1
00:00:07,961 --> 00:00:11,075
- Okay. Can everyone hear me?
Okay.

2
00:00:11,075 --> 00:00:12,153
Sorry for the delay.

3
00:00:12,153 --> 00:00:13,563
I had a bit of technical difficulty.

4
00:00:13,563 --> 00:00:15,891
Today was the first time
I was trying to use my

5
00:00:15,891 --> 00:00:17,843
new touch bar Mac book pro for presenting,

6
00:00:17,843 --> 00:00:19,433
and none of the adapters are working.

7
00:00:19,433 --> 00:00:21,814
So, I had to switch
laptops at the last minute.

8
00:00:21,814 --> 00:00:22,731
So, thanks.

9
00:00:24,003 --> 00:00:25,353
Sorry about that.

10
00:00:25,353 --> 00:00:27,451
So, today is lecture 10.

11
00:00:27,451 --> 00:00:30,420
We're talking about
recurrent neural networks.

12
00:00:30,420 --> 00:00:33,003
So, as of, as usual, a
couple administrative notes.

13
00:00:33,003 --> 00:00:34,835
So, We're working hard

14
00:00:35,891 --> 00:00:37,353
on assignment one grading.

15
00:00:37,353 --> 00:00:38,913
Those grades will probably be out

16
00:00:38,913 --> 00:00:40,921
sometime later today.

17
00:00:40,921 --> 00:00:42,531
Hopefully, they can get out

18
00:00:42,531 --> 00:00:44,382
before the A2 deadline.

19
00:00:44,382 --> 00:00:46,251
That's what I'm hoping for.

20
00:00:46,251 --> 00:00:50,361
On a related note, Assignment
two is due today at 11:59 p.m.

21
00:00:50,361 --> 00:00:53,111
so, who's done with that already?

22
00:00:55,280 --> 00:00:56,633
About half you guys.

23
00:00:56,633 --> 00:00:58,273
So, you remember, I did warn you

24
00:00:58,273 --> 00:00:59,251
when the assignment went out

25
00:00:59,251 --> 00:01:01,091
that it was quite long, to start early.

26
00:01:01,091 --> 00:01:03,811
So, you were warned about that.

27
00:01:03,811 --> 00:01:06,561
But, hopefully, you guys
have some late days left.

28
00:01:06,561 --> 00:01:08,160
Also, as another reminder,

29
00:01:08,160 --> 00:01:10,531
the midterm will be in class on Tuesday.

30
00:01:10,531 --> 00:01:12,321
If you kind of look
around the lecture hall,

31
00:01:12,321 --> 00:01:13,782
there are not enough seats in this room

32
00:01:13,782 --> 00:01:15,881
to seat all the enrolled
students in the class.

33
00:01:15,881 --> 00:01:18,113
So, we'll actually be having the midterm

34
00:01:18,113 --> 00:01:20,062
in several other lecture
halls across campus.

35
00:01:20,062 --> 00:01:21,702
And we'll be sending out some more details

36
00:01:21,702 --> 00:01:26,099
on exactly where to go in
the next couple of days.

37
00:01:26,099 --> 00:01:28,179
So a bit of a, another
bit of announcement.

38
00:01:28,179 --> 00:01:29,161
We've been working on this sort of

39
00:01:29,161 --> 00:01:31,169
fun bit of extra credit
thing for you to play with

40
00:01:31,169 --> 00:01:33,249
that we're calling the training game.

41
00:01:33,249 --> 00:01:34,950
This is this cool
browser based experience,

42
00:01:34,950 --> 00:01:36,159
where you can go in

43
00:01:36,159 --> 00:01:37,480
and interactively train neural networks

44
00:01:37,480 --> 00:01:39,927
and tweak the hyper
parameters during training.

45
00:01:39,927 --> 00:01:42,289
And this should be a
really cool interactive way

46
00:01:42,289 --> 00:01:43,219
for you to practice

47
00:01:43,219 --> 00:01:45,395
some of these hyper
parameter tuning skills

48
00:01:45,395 --> 00:01:47,646
that we've been talking about
the last couple of lectures.

49
00:01:47,646 --> 00:01:48,681
So this is not required,

50
00:01:48,681 --> 00:01:51,629
but this, I think, will be
a really useful experience

51
00:01:51,629 --> 00:01:53,190
to gain a little bit more intuition

52
00:01:53,190 --> 00:01:54,961
into how some of these
hyper parameters work

53
00:01:54,961 --> 00:01:57,481
for different types of
data sets in practice.

54
00:01:57,481 --> 00:01:59,339
So we're still working on getting

55
00:01:59,339 --> 00:02:00,950
all the bugs worked out of this setup,

56
00:02:00,950 --> 00:02:02,319
and we'll probably send out

57
00:02:02,319 --> 00:02:03,459
some more instructions

58
00:02:03,459 --> 00:02:04,499
on exactly how this will work

59
00:02:04,499 --> 00:02:05,790
in the next couple of days.

60
00:02:05,790 --> 00:02:07,209
But again, not required.

61
00:02:07,209 --> 00:02:08,280
But please do check it out.

62
00:02:08,280 --> 00:02:09,161
I think it'll be really fun

63
00:02:09,161 --> 00:02:11,008
and a really cool thing
for you to play with.

64
00:02:11,008 --> 00:02:12,529
And will give you a bit of extra credit

65
00:02:12,529 --> 00:02:13,779
if you do some,

66
00:02:14,783 --> 00:02:15,897
if you end up working with this

67
00:02:15,897 --> 00:02:18,208
and doing a couple of runs with it.

68
00:02:18,208 --> 00:02:20,041
So, we'll again send out
some more details about this

69
00:02:20,041 --> 00:02:23,458
soon once we get all the bugs worked out.

70
00:02:24,720 --> 00:02:25,819
As a reminder,

71
00:02:25,819 --> 00:02:28,139
last time we were talking
about CNN Architectures.

72
00:02:28,139 --> 00:02:29,990
We kind of walked through the time line

73
00:02:29,990 --> 00:02:31,150
of some of the various winners

74
00:02:31,150 --> 00:02:33,081
of the image net classification challenge,

75
00:02:33,081 --> 00:02:35,006
kind of the breakthrough result.

76
00:02:35,006 --> 00:02:37,659
As we saw was the AlexNet
architecture in 2012,

77
00:02:37,659 --> 00:02:39,979
which was a nine layer
convolutional network.

78
00:02:39,979 --> 00:02:41,331
It did amazingly well,

79
00:02:41,331 --> 00:02:42,401
and it sort of kick started

80
00:02:42,401 --> 00:02:44,921
this whole deep learning
revolution in computer vision,

81
00:02:44,921 --> 00:02:46,361
and kind of brought a lot of these models

82
00:02:46,361 --> 00:02:48,081
into the mainstream.

83
00:02:48,081 --> 00:02:50,038
Then we skipped ahead a couple years,

84
00:02:50,038 --> 00:02:52,470
and saw that in 2014 image net challenge,

85
00:02:52,470 --> 00:02:54,278
we had these two really
interesting models,

86
00:02:54,278 --> 00:02:55,111
VGG and GoogLeNet,

87
00:02:55,111 --> 00:02:56,699
which were much deeper.

88
00:02:56,699 --> 00:02:57,532
So VGG was,

89
00:02:57,532 --> 00:02:59,870
they had a 16 and a 19 layer model,

90
00:02:59,870 --> 00:03:01,150
and GoogLeNet was, I believe,

91
00:03:01,150 --> 00:03:02,930
a 22 layer model.

92
00:03:02,930 --> 00:03:04,750
Although one thing that
is kind of interesting

93
00:03:04,750 --> 00:03:05,811
about these models

94
00:03:05,811 --> 00:03:08,121
is that the 2014 image net challenge

95
00:03:08,121 --> 00:03:11,230
was right before batch
normalization was invented.

96
00:03:11,230 --> 00:03:12,411
So at this time,

97
00:03:12,411 --> 00:03:13,979
before the invention
of batch normalization,

98
00:03:13,979 --> 00:03:15,870
training these relatively deep models

99
00:03:15,870 --> 00:03:18,761
of roughly twenty layers
was very challenging.

100
00:03:18,761 --> 00:03:20,510
So, in fact, both of these two models

101
00:03:20,510 --> 00:03:22,310
had to resort to a little bit of hackery

102
00:03:22,310 --> 00:03:24,869
in order to get their
deep models to converge.

103
00:03:24,869 --> 00:03:28,579
So for VGG, they had the
16 and 19 layer models,

104
00:03:28,579 --> 00:03:31,870
but actually they first
trained an 11 layer model,

105
00:03:31,870 --> 00:03:34,107
because that was what they
could get to converge.

106
00:03:34,107 --> 00:03:36,600
And then added some extra
random layers in the middle

107
00:03:36,600 --> 00:03:37,771
and then continued training,

108
00:03:37,771 --> 00:03:40,059
actually training the
16 and 19 layer models.

109
00:03:40,059 --> 00:03:42,361
So, managing this training process

110
00:03:42,361 --> 00:03:44,058
was very challenging in 2014

111
00:03:44,058 --> 00:03:46,539
before the invention
of batch normalization.

112
00:03:46,539 --> 00:03:47,801
Similarly, for GoogLeNet,

113
00:03:47,801 --> 00:03:50,251
we saw that GoogLeNet has
these auxiliary classifiers

114
00:03:50,251 --> 00:03:52,539
that were stuck into lower
layers of the network.

115
00:03:52,539 --> 00:03:54,959
And these were not really
needed for the class to,

116
00:03:54,959 --> 00:03:56,539
to get good classification performance.

117
00:03:56,539 --> 00:03:59,390
This was just sort of a way to cause

118
00:03:59,390 --> 00:04:01,201
extra gradient to be injected

119
00:04:01,201 --> 00:04:03,430
directly into the lower
layers of the network.

120
00:04:03,430 --> 00:04:05,315
And this sort of,

121
00:04:05,315 --> 00:04:08,110
this again was before the
invention of batch normalization

122
00:04:08,110 --> 00:04:09,430
and now once you have these networks

123
00:04:09,430 --> 00:04:10,411
with batch normalization,

124
00:04:10,411 --> 00:04:13,321
then you no longer need
these slightly ugly hacks

125
00:04:13,321 --> 00:04:17,321
in order to get these
deeper models to converge.

126
00:04:17,321 --> 00:04:20,801
Then we also saw in the
2015 image net challenge

127
00:04:20,801 --> 00:04:22,979
was this really cool model called ResNet,

128
00:04:22,979 --> 00:04:24,350
these residual networks

129
00:04:24,350 --> 00:04:26,321
that now have these shortcut connections

130
00:04:26,321 --> 00:04:28,310
that actually have these
little residual blocks

131
00:04:28,310 --> 00:04:30,721
where we're going to take our input,

132
00:04:30,721 --> 00:04:32,280
pass it through the residual blocks,

133
00:04:32,280 --> 00:04:33,989
and then add that output of the,

134
00:04:33,989 --> 00:04:36,569
then add our input to the block,

135
00:04:36,569 --> 00:04:39,110
to the output from these
convolutional layers.

136
00:04:39,110 --> 00:04:40,790
This is kind of a funny architecture,

137
00:04:40,790 --> 00:04:43,308
but it actually has two
really nice properties.

138
00:04:43,308 --> 00:04:45,691
One is that if we just set all the weights

139
00:04:45,691 --> 00:04:47,481
in this residual block to zero,

140
00:04:47,481 --> 00:04:49,531
then this block is competing the identity.

141
00:04:49,531 --> 00:04:50,451
So in some way,

142
00:04:50,451 --> 00:04:52,881
it's relatively easy for this model

143
00:04:52,881 --> 00:04:55,681
to learn not to use the
layers that it doesn't need.

144
00:04:55,681 --> 00:04:57,179
In addition, it kind of adds this

145
00:04:57,179 --> 00:05:00,110
interpretation to L2 regularization

146
00:05:00,110 --> 00:05:02,171
in the context of these neural networks,

147
00:05:02,171 --> 00:05:03,771
cause once you put L2 regularization,

148
00:05:03,771 --> 00:05:04,881
remember, on your,

149
00:05:04,881 --> 00:05:06,059
on the weights of your network,

150
00:05:06,059 --> 00:05:08,321
that's going to drive all
the parameters towards zero.

151
00:05:08,321 --> 00:05:10,211
And maybe your standard
convolutional architecture

152
00:05:10,211 --> 00:05:11,379
is driving towards zero.

153
00:05:11,379 --> 00:05:12,739
Maybe it doesn't make sense.

154
00:05:12,739 --> 00:05:14,481
But in the context of a residual network,

155
00:05:14,481 --> 00:05:16,561
if you drive all the
parameters towards zero,

156
00:05:16,561 --> 00:05:18,051
that's kind of encouraging the model

157
00:05:18,051 --> 00:05:20,510
to not use layers that it doesn't need,

158
00:05:20,510 --> 00:05:21,891
because it will just drive those,

159
00:05:21,891 --> 00:05:23,579
the residual blocks towards the identity,

160
00:05:23,579 --> 00:05:26,310
whether or not needed for classification.

161
00:05:26,310 --> 00:05:28,538
The other really useful property
of these residual networks

162
00:05:28,538 --> 00:05:31,371
has to do with the gradient
flow in the backward paths.

163
00:05:31,371 --> 00:05:32,971
If you remember what happens
at these addition gates

164
00:05:32,971 --> 00:05:34,361
in the backward pass,

165
00:05:34,361 --> 00:05:35,451
when upstream gradient is coming in

166
00:05:35,451 --> 00:05:37,110
through an addition gate,

167
00:05:37,110 --> 00:05:37,943
then it will split

168
00:05:37,943 --> 00:05:39,881
and fork along these two different paths.

169
00:05:39,881 --> 00:05:42,750
So then, when upstream gradient comes in,

170
00:05:42,750 --> 00:05:46,361
it'll take one path through
these convolutional blocks,

171
00:05:46,361 --> 00:05:48,611
but it will also have a direct
connection of the gradient

172
00:05:48,611 --> 00:05:50,811
through this residual connection.

173
00:05:50,811 --> 00:05:51,920
So then when you look at,

174
00:05:51,920 --> 00:05:53,211
when you imagine stacking many of these

175
00:05:53,211 --> 00:05:55,630
residual blocks on top of each other,

176
00:05:55,630 --> 00:05:57,091
and our network ends up with hundreds of,

177
00:05:57,091 --> 00:05:59,150
potentially hundreds of layers.

178
00:05:59,150 --> 00:06:00,578
Then, these residual connections

179
00:06:00,578 --> 00:06:02,841
give a sort of gradient super highway

180
00:06:02,841 --> 00:06:04,361
for gradients to flow backward

181
00:06:04,361 --> 00:06:05,561
through the entire network.

182
00:06:05,561 --> 00:06:08,299
And this allows it to train much easier

183
00:06:08,299 --> 00:06:09,630
and much faster.

184
00:06:09,630 --> 00:06:11,499
And actually allows
these things to converge

185
00:06:11,499 --> 00:06:12,459
reasonably well,

186
00:06:12,459 --> 00:06:15,738
even when the model is potentially
hundreds of layers deep.

187
00:06:15,738 --> 00:06:19,122
And this idea of managing
gradient flow in your models

188
00:06:19,122 --> 00:06:21,550
is actually super important
everywhere in machine learning.

189
00:06:21,550 --> 00:06:23,880
And super prevalent in
recurrent networks as well.

190
00:06:23,880 --> 00:06:26,481
So we'll definitely revisit
this idea of gradient flow

191
00:06:26,481 --> 00:06:28,564
later in today's lecture.

192
00:06:31,148 --> 00:06:34,377
So then, we kind of also saw
a couple other more exotic,

193
00:06:34,377 --> 00:06:36,131
more recent CNN architectures last time,

194
00:06:36,131 --> 00:06:38,068
including DenseNet and FractalNet,

195
00:06:38,068 --> 00:06:39,771
and once you think about
these architectures

196
00:06:39,771 --> 00:06:41,321
in terms of gradient flow,

197
00:06:41,321 --> 00:06:43,070
they make a little bit more sense.

198
00:06:43,070 --> 00:06:45,019
These things like DenseNet and FractalNet

199
00:06:45,019 --> 00:06:46,761
are adding these additional shortcut

200
00:06:46,761 --> 00:06:48,619
or identity connections inside the model.

201
00:06:48,619 --> 00:06:50,241
And if you think about what happens

202
00:06:50,241 --> 00:06:52,011
in the backwards pass for these models,

203
00:06:52,011 --> 00:06:53,411
these additional funny topologies

204
00:06:53,411 --> 00:06:55,251
are basically providing direct paths

205
00:06:55,251 --> 00:06:56,550
for gradients to flow

206
00:06:56,550 --> 00:06:58,139
from the loss at the end of the network

207
00:06:58,139 --> 00:07:00,571
more easily into all the
different layers of the network.

208
00:07:00,571 --> 00:07:01,870
So I think that,

209
00:07:01,870 --> 00:07:04,491
again, this idea of managing
gradient flow properly

210
00:07:04,491 --> 00:07:06,579
in your CNN Architectures

211
00:07:06,579 --> 00:07:07,771
is something that we've really seen

212
00:07:07,771 --> 00:07:09,760
a lot more in the last couple of years.

213
00:07:09,760 --> 00:07:11,721
And will probably see more moving forward

214
00:07:11,721 --> 00:07:15,221
as more exotic architectures are invented.

215
00:07:16,257 --> 00:07:18,189
We also saw this kind of nice plot,

216
00:07:18,189 --> 00:07:19,939
plotting performance of

217
00:07:19,939 --> 00:07:22,081
the number of flops versus
the number of parameters

218
00:07:22,081 --> 00:07:24,331
versus the run time of
these various models.

219
00:07:24,331 --> 00:07:25,790
And there's some
interesting characteristics

220
00:07:25,790 --> 00:07:27,971
that you can dive in
and see from this plot.

221
00:07:27,971 --> 00:07:31,110
One idea is that VGG and AlexNet

222
00:07:31,110 --> 00:07:32,801
have a huge number of parameters,

223
00:07:32,801 --> 00:07:34,291
and these parameters actually come

224
00:07:34,291 --> 00:07:35,961
almost entirely from the
fully connected layers

225
00:07:35,961 --> 00:07:37,119
of the models.

226
00:07:37,119 --> 00:07:39,959
So AlexNet has something like
roughly 62 million parameters,

227
00:07:39,959 --> 00:07:42,470
and if you look at that
last fully connected layer,

228
00:07:42,470 --> 00:07:44,761
the final fully connected layer in AlexNet

229
00:07:44,761 --> 00:07:47,771
is going from an activation
volume of six by six by 256

230
00:07:47,771 --> 00:07:51,190
into this fully connected vector of 496.

231
00:07:51,190 --> 00:07:53,310
So if you imagine what the weight matrix

232
00:07:53,310 --> 00:07:54,851
needs to look like at that layer,

233
00:07:54,851 --> 00:07:56,851
the weight matrix is gigantic.

234
00:07:56,851 --> 00:07:58,630
It's number of entries is six by six,

235
00:07:58,630 --> 00:08:01,921
six times six times 256 times 496.

236
00:08:01,921 --> 00:08:03,390
And if you multiply that out,

237
00:08:03,390 --> 00:08:04,491
you see that that single layer

238
00:08:04,491 --> 00:08:06,370
has 38 million parameters.

239
00:08:06,370 --> 00:08:07,977
So more than half of the parameters

240
00:08:07,977 --> 00:08:09,219
of the entire AlexNet model

241
00:08:09,219 --> 00:08:11,859
are just sitting in that
last fully connected layer.

242
00:08:11,859 --> 00:08:13,339
And if you add up all the parameters

243
00:08:13,339 --> 00:08:15,971
in just the fully connected
layers of AlexNet,

244
00:08:15,971 --> 00:08:17,739
including these other
fully connected layers,

245
00:08:17,739 --> 00:08:20,721
you see something like
59 of the 62 million

246
00:08:20,721 --> 00:08:21,841
parameters in AlexNet

247
00:08:21,841 --> 00:08:24,241
are sitting in these
fully connected layers.

248
00:08:24,241 --> 00:08:26,439
So then when we move other architectures,

249
00:08:26,439 --> 00:08:27,601
like GoogLeNet and ResNet,

250
00:08:27,601 --> 00:08:29,739
they do away with a lot of these large

251
00:08:29,739 --> 00:08:31,110
fully connected layers

252
00:08:31,110 --> 00:08:32,611
in favor of global average pooling

253
00:08:32,611 --> 00:08:33,698
at the end of the network.

254
00:08:33,698 --> 00:08:35,200
And this allows these
networks to really cut,

255
00:08:35,201 --> 00:08:36,971
these nicer architectures,

256
00:08:36,971 --> 00:08:39,019
to really cut down the parameter count

257
00:08:39,019 --> 00:08:40,936
in these architectures.

258
00:08:44,463 --> 00:08:46,641
So that was kind of our brief recap

259
00:08:46,641 --> 00:08:48,771
of the CNN architectures
that we saw last lecture,

260
00:08:48,771 --> 00:08:49,604
and then today,

261
00:08:49,604 --> 00:08:50,462
we're going to move to

262
00:08:50,462 --> 00:08:52,891
one of my favorite topics to talk about,

263
00:08:52,891 --> 00:08:56,321
which is recurrent neural networks.

264
00:08:56,321 --> 00:08:57,502
So, so far in this class,

265
00:08:57,502 --> 00:08:59,342
we've seen, what I like to think of

266
00:08:59,342 --> 00:09:01,759
as kind of a vanilla feed forward network,

267
00:09:01,759 --> 00:09:03,222
all of our network
architectures have this flavor,

268
00:09:03,222 --> 00:09:05,160
where we receive some input

269
00:09:05,160 --> 00:09:07,353
and that input is a fixed size object,

270
00:09:07,353 --> 00:09:08,593
like an image or vector.

271
00:09:08,593 --> 00:09:11,691
That input is fed through
some set of hidden layers

272
00:09:11,691 --> 00:09:13,850
and produces a single output,

273
00:09:13,850 --> 00:09:14,851
like a classifications,

274
00:09:14,851 --> 00:09:16,793
like a set of classifications scores

275
00:09:16,793 --> 00:09:18,876
over a set of categories.

276
00:09:20,071 --> 00:09:22,193
But in some context in machine learning,

277
00:09:22,193 --> 00:09:23,921
we want to have more flexibility

278
00:09:23,921 --> 00:09:25,942
in the types of data that
our models can process.

279
00:09:25,942 --> 00:09:29,193
So once we move to this idea
of recurrent neural networks,

280
00:09:29,193 --> 00:09:31,121
we have a lot more opportunities

281
00:09:31,121 --> 00:09:33,571
to play around with the types
of input and output data

282
00:09:33,571 --> 00:09:35,313
that our networks can handle.

283
00:09:35,313 --> 00:09:37,329
So once we have recurrent neural networks,

284
00:09:37,329 --> 00:09:41,009
we can do what we call
these one to many models.

285
00:09:41,009 --> 00:09:44,161
Or where maybe our input is
some object of fixed size,

286
00:09:44,161 --> 00:09:45,032
like an image,

287
00:09:45,032 --> 00:09:47,500
but now our output is a
sequence of variable length,

288
00:09:47,500 --> 00:09:48,721
such as a caption.

289
00:09:48,721 --> 00:09:49,801
Where different captions

290
00:09:49,801 --> 00:09:51,433
might have different numbers of words,

291
00:09:51,433 --> 00:09:54,081
so our output needs to
be variable in length.

292
00:09:54,081 --> 00:09:56,491
We also might have many to one models,

293
00:09:56,491 --> 00:09:58,622
where our input could be variably sized.

294
00:09:58,622 --> 00:10:01,001
This might be something
like a piece of text,

295
00:10:01,001 --> 00:10:03,782
and we want to say what is
the sentiment of that text,

296
00:10:03,782 --> 00:10:06,161
whether it's positive or
negative in sentiment.

297
00:10:06,161 --> 00:10:07,771
Or in a computer vision context,

298
00:10:07,771 --> 00:10:09,401
you might imagine taking as input a video,

299
00:10:09,401 --> 00:10:12,512
and that video might have a
variable number of frames.

300
00:10:12,512 --> 00:10:14,921
And now we want to read this entire video

301
00:10:14,921 --> 00:10:16,401
of potentially variable length.

302
00:10:16,401 --> 00:10:17,439
And then at the end,

303
00:10:17,439 --> 00:10:18,742
make a classification decision

304
00:10:18,742 --> 00:10:20,142
about maybe what kind
of activity or action

305
00:10:20,142 --> 00:10:22,721
is going on in that video.

306
00:10:22,721 --> 00:10:26,702
We also have a, we
might also have problems

307
00:10:26,702 --> 00:10:28,091
where we want both the inputs

308
00:10:28,091 --> 00:10:29,931
and the output to be variable in length.

309
00:10:29,931 --> 00:10:31,491
We might see something like this

310
00:10:31,491 --> 00:10:33,433
in machine translation,

311
00:10:33,433 --> 00:10:35,870
where our input is some,
maybe, sentence in English,

312
00:10:35,870 --> 00:10:37,302
which could have a variable length,

313
00:10:37,302 --> 00:10:39,393
and our output is maybe
some sentence in French,

314
00:10:39,393 --> 00:10:41,633
which also could have a variable length.

315
00:10:41,633 --> 00:10:44,032
And crucially, the length
of the English sentence

316
00:10:44,032 --> 00:10:46,801
might be different from the
length of the French sentence.

317
00:10:46,801 --> 00:10:48,710
So we need some models
that have the capacity

318
00:10:48,710 --> 00:10:50,782
to accept both variable length sequences

319
00:10:50,782 --> 00:10:53,931
on the input and on the output.

320
00:10:53,931 --> 00:10:56,331
Finally, we might also
consider problems where

321
00:10:56,331 --> 00:10:58,153
our input is variably length,

322
00:10:58,153 --> 00:10:59,891
like something like a video sequence

323
00:10:59,891 --> 00:11:01,553
with a variable number of frames.

324
00:11:01,553 --> 00:11:02,920
And now we want to make a decision

325
00:11:02,920 --> 00:11:04,771
for each element of that input sequence.

326
00:11:04,771 --> 00:11:06,721
So in the context of videos,

327
00:11:06,721 --> 00:11:09,201
that might be making some
classification decision

328
00:11:09,201 --> 00:11:11,891
along every frame of the video.

329
00:11:11,891 --> 00:11:13,281
And recurrent neural networks

330
00:11:13,281 --> 00:11:15,171
are this kind of general paradigm

331
00:11:15,171 --> 00:11:17,401
for handling variable sized sequence data

332
00:11:17,401 --> 00:11:19,302
that allow us to pretty naturally capture

333
00:11:19,302 --> 00:11:23,469
all of these different types
of setups in our models.

334
00:11:24,349 --> 00:11:26,662
So recurring neural networks
are actually important,

335
00:11:26,662 --> 00:11:30,030
even for some problems that
have a fixed size input

336
00:11:30,030 --> 00:11:31,414
and a fixed size output.

337
00:11:31,414 --> 00:11:33,752
Recurrent neural networks
can still be pretty useful.

338
00:11:33,752 --> 00:11:34,763
So in this example,

339
00:11:34,763 --> 00:11:35,982
we might want to do, for example,

340
00:11:35,982 --> 00:11:38,793
sequential processing of our input.

341
00:11:38,793 --> 00:11:40,721
So here, we're receiving
a fixed size input

342
00:11:40,721 --> 00:11:41,774
like an image,

343
00:11:41,774 --> 00:11:43,732
and we want to make a
classification decision

344
00:11:43,732 --> 00:11:46,227
about, like, what number is
being shown in this image?

345
00:11:46,227 --> 00:11:48,854
But now, rather than just doing
a single feed forward pass

346
00:11:48,854 --> 00:11:50,393
and making the decision all at once,

347
00:11:50,393 --> 00:11:52,702
this network is actually
looking around the image

348
00:11:52,702 --> 00:11:55,553
and taking various glimpses of
different parts of the image.

349
00:11:55,553 --> 00:11:58,233
And then after making
some series of glimpses,

350
00:11:58,233 --> 00:11:59,882
then it makes its final decision

351
00:11:59,882 --> 00:12:01,742
as to what kind of number is present.

352
00:12:01,742 --> 00:12:03,233
So here, we had one,

353
00:12:03,233 --> 00:12:05,462
So here, even though
our input and outputs,

354
00:12:05,462 --> 00:12:06,622
our input was an image,

355
00:12:06,622 --> 00:12:09,054
and our output was a
classification decision,

356
00:12:09,054 --> 00:12:10,243
even this context,

357
00:12:10,243 --> 00:12:11,761
this idea of being able to handle

358
00:12:11,761 --> 00:12:14,121
variably length processing
with recurrent neural networks

359
00:12:14,121 --> 00:12:17,473
can lead to some really
interesting types of models.

360
00:12:17,473 --> 00:12:20,273
There's a really cool paper that I like

361
00:12:20,273 --> 00:12:22,140
that applied this same type of idea

362
00:12:22,140 --> 00:12:23,923
to generating new images.

363
00:12:23,923 --> 00:12:27,083
Where now, we want the model
to synthesize brand new images

364
00:12:27,083 --> 00:12:29,723
that look kind of like the
images it saw in training,

365
00:12:29,723 --> 00:12:32,489
and we can use a recurrent
neural network architecture

366
00:12:32,489 --> 00:12:34,222
to actually paint these output images

367
00:12:34,222 --> 00:12:36,254
sort of one piece at a time in the output.

368
00:12:36,254 --> 00:12:37,342
You can see that,

369
00:12:37,342 --> 00:12:40,260
even though our output
is this fixed size image,

370
00:12:40,260 --> 00:12:42,942
we can have these models
that are working over time

371
00:12:42,942 --> 00:12:46,380
to compute parts of the output
one at a time sequentially.

372
00:12:46,380 --> 00:12:48,577
And we can use recurrent neural networds

373
00:12:48,577 --> 00:12:51,662
for that type of setup as well.

374
00:12:51,662 --> 00:12:54,054
So after this sort of cool pitch

375
00:12:54,054 --> 00:12:55,854
about all these cool
things that RNNs can do,

376
00:12:55,854 --> 00:12:58,785
you might wonder, like what
exactly are these things?

377
00:12:58,785 --> 00:13:00,353
So in general, a recurrent neural network

378
00:13:00,353 --> 00:13:04,163
is this little, has this
little recurrent core cell

379
00:13:04,163 --> 00:13:06,272
and it will take some input x,

380
00:13:06,272 --> 00:13:08,734
feed that input into the RNN,

381
00:13:08,734 --> 00:13:11,382
and that RNN has some
internal hidden state,

382
00:13:11,382 --> 00:13:14,198
and that internal hidden
state will be updated

383
00:13:14,198 --> 00:13:17,641
every time that the RNN reads a new input.

384
00:13:17,641 --> 00:13:19,654
And that internal hidden state

385
00:13:19,654 --> 00:13:21,243
will be then fed back to the model

386
00:13:21,243 --> 00:13:23,980
the next time it reads an input.

387
00:13:23,980 --> 00:13:26,334
And frequently, we will want our RNN"s

388
00:13:26,334 --> 00:13:28,822
to also produce some
output at every time step,

389
00:13:28,822 --> 00:13:31,043
so we'll have this pattern
where it will read an input,

390
00:13:31,043 --> 00:13:32,219
update its hidden state,

391
00:13:32,219 --> 00:13:34,469
and then produce an output.

392
00:13:35,814 --> 00:13:37,213
So then the question is

393
00:13:37,213 --> 00:13:39,862
what is the functional form
of this recurrence relation

394
00:13:39,862 --> 00:13:40,961
that we're computing?

395
00:13:40,961 --> 00:13:42,923
So inside this little green RNN block,

396
00:13:42,923 --> 00:13:45,062
we're computing some recurrence relation,

397
00:13:45,062 --> 00:13:46,443
with a function f.

398
00:13:46,443 --> 00:13:49,094
So this function f will
depend on some weights, w.

399
00:13:49,094 --> 00:13:52,822
It will accept the previous
hidden state, h t - 1,

400
00:13:52,822 --> 00:13:55,374
as well as the input at
the current state, x t,

401
00:13:55,374 --> 00:13:56,683
and this will output

402
00:13:56,683 --> 00:13:59,782
the next hidden state, or
the updated hidden state,

403
00:13:59,782 --> 00:14:01,420
that we call h t.

404
00:14:01,420 --> 00:14:02,253
And now,

405
00:14:02,253 --> 00:14:04,382
then as we read the next input,

406
00:14:04,382 --> 00:14:05,283
this hidden state,

407
00:14:05,283 --> 00:14:06,894
this new hidden state, h t,

408
00:14:06,894 --> 00:14:08,774
will then just be passed
into the same function

409
00:14:08,774 --> 00:14:11,552
as we read the next input, x t plus one.

410
00:14:11,552 --> 00:14:13,894
And now, if we wanted
to produce some output

411
00:14:13,894 --> 00:14:15,273
at every time step of this network,

412
00:14:15,273 --> 00:14:20,203
we might attach some additional
fully connected layers

413
00:14:20,203 --> 00:14:21,797
that read in this h t at every time step.

414
00:14:21,797 --> 00:14:23,407
And make that decision based

415
00:14:23,407 --> 00:14:27,327
on the hidden state at every time step.

416
00:14:27,327 --> 00:14:29,018
And one thing to note is that

417
00:14:29,018 --> 00:14:31,087
we use the same function, f w,

418
00:14:31,087 --> 00:14:32,495
and the same weights, w,

419
00:14:32,495 --> 00:14:35,662
at every time step of the computation.

420
00:14:36,921 --> 00:14:39,706
So then kind of the simplest function form

421
00:14:39,706 --> 00:14:40,634
that you can imagine

422
00:14:40,634 --> 00:14:43,434
is what we call this vanilla
recurrent neural network.

423
00:14:43,434 --> 00:14:45,826
So here, we have this same functional form

424
00:14:45,826 --> 00:14:46,866
from the previous slide,

425
00:14:46,866 --> 00:14:49,191
where we're taking in
our previous hidden state

426
00:14:49,191 --> 00:14:50,145
and our current input

427
00:14:50,145 --> 00:14:52,483
and we need to produce
the next hidden state.

428
00:14:52,483 --> 00:14:54,954
And the kind of simplest
thing you might imagine

429
00:14:54,954 --> 00:14:57,724
is that we have some weight matrix, w x h,

430
00:14:57,724 --> 00:15:00,124
that we multiply against the input, x t,

431
00:15:00,124 --> 00:15:03,162
as well as another weight matrix, w h h,

432
00:15:03,162 --> 00:15:05,615
that we multiply against
the previous hidden state.

433
00:15:05,615 --> 00:15:07,226
So we make these two multiplications

434
00:15:07,226 --> 00:15:08,494
against our two states,

435
00:15:08,494 --> 00:15:09,327
add them together,

436
00:15:09,327 --> 00:15:10,924
and squash them through a tanh,

437
00:15:10,924 --> 00:15:13,514
so we get some kind of non
linearity in the system.

438
00:15:13,514 --> 00:15:15,287
You might be wondering
why we use a tanh here

439
00:15:15,287 --> 00:15:17,312
and not some other type of non-linearity?

440
00:15:17,312 --> 00:15:18,824
After all that we've said negative

441
00:15:18,824 --> 00:15:20,594
about tanh's in previous lectures,

442
00:15:20,594 --> 00:15:22,424
and I think we'll return
a little bit to that

443
00:15:22,424 --> 00:15:23,257
later on when we talk about

444
00:15:23,257 --> 00:15:26,507
more advanced architectures, like lstm.

445
00:15:27,346 --> 00:15:28,695
So then, this,

446
00:15:28,695 --> 00:15:30,872
So then, in addition in this architecture,

447
00:15:30,872 --> 00:15:33,394
if we wanted to produce
some y t at every time step,

448
00:15:33,394 --> 00:15:35,284
you might have another weight matrix, w,

449
00:15:35,284 --> 00:15:38,055
you might have another weight matrix

450
00:15:38,055 --> 00:15:39,055
that accepts this hidden state

451
00:15:39,055 --> 00:15:40,375
and then transforms it to some y

452
00:15:40,375 --> 00:15:42,724
to produce maybe some
class score predictions

453
00:15:42,724 --> 00:15:44,826
at every time step.

454
00:15:44,826 --> 00:15:46,723
And when I think about
recurrent neural networks,

455
00:15:46,723 --> 00:15:48,775
I kind of think about, you can also,

456
00:15:48,775 --> 00:15:50,634
you can kind of think of
recurrent neural networks

457
00:15:50,634 --> 00:15:51,487
in two ways.

458
00:15:51,487 --> 00:15:53,713
One is this concept of
having a hidden state

459
00:15:53,713 --> 00:15:57,095
that feeds back at itself, recurrently.

460
00:15:57,095 --> 00:15:59,135
But I find that picture
a little bit confusing.

461
00:15:59,135 --> 00:16:01,073
And sometimes, I find it clearer

462
00:16:01,073 --> 00:16:04,546
to think about unrolling
this computational graph

463
00:16:04,546 --> 00:16:05,914
for multiple time steps.

464
00:16:05,914 --> 00:16:08,382
And this makes the data
flow of the hidden states

465
00:16:08,382 --> 00:16:09,914
and the inputs and the
outputs and the weights

466
00:16:09,914 --> 00:16:11,786
maybe a little bit more clear.

467
00:16:11,786 --> 00:16:13,095
So then at the first time step,

468
00:16:13,095 --> 00:16:15,494
we'll have some initial
hidden state h zero.

469
00:16:15,494 --> 00:16:18,914
This is usually initialized
to zeros for most context,

470
00:16:18,914 --> 00:16:22,415
in most contexts, an then
we'll have some input, x t.

471
00:16:22,415 --> 00:16:25,084
This initial hidden state, h zero,

472
00:16:25,084 --> 00:16:26,906
and our current input, x t,

473
00:16:26,906 --> 00:16:28,324
will go into our f w function.

474
00:16:28,324 --> 00:16:32,604
This will produce our
next hidden state, h one.

475
00:16:32,604 --> 00:16:34,034
And then, we'll repeat this process

476
00:16:34,034 --> 00:16:36,154
when we receive the next input.

477
00:16:36,154 --> 00:16:38,295
So now our current h one and our x one,

478
00:16:38,295 --> 00:16:40,036
will go into that same f w,

479
00:16:40,036 --> 00:16:42,847
to produce our next output, h two.

480
00:16:42,847 --> 00:16:45,866
And this process will
repeat over and over again,

481
00:16:45,866 --> 00:16:48,365
as we consume all of the input, x ts,

482
00:16:48,365 --> 00:16:50,866
in our sequence of inputs.

483
00:16:50,866 --> 00:16:52,116
And now, one thing to note, is that

484
00:16:52,116 --> 00:16:54,756
we can actually make
this even more explicit

485
00:16:54,756 --> 00:16:58,036
and write the w matrix in
our computational graph.

486
00:16:58,036 --> 00:16:59,058
And here you can see that

487
00:16:59,058 --> 00:17:01,434
we're re-using the same w matrix

488
00:17:01,434 --> 00:17:03,415
at every time step of the computation.

489
00:17:03,415 --> 00:17:06,145
So now every time that we
have this little f w block,

490
00:17:06,145 --> 00:17:08,724
it's receiving a unique h and a unique x,

491
00:17:08,724 --> 00:17:11,006
but all of these blocks
are taking the same w.

492
00:17:11,007 --> 00:17:12,236
And if you remember,

493
00:17:12,236 --> 00:17:16,116
we talked about how gradient
flows in back propagation,

494
00:17:16,116 --> 00:17:17,058
when you re-use the same,

495
00:17:17,058 --> 00:17:19,218
when you re-use the
same node multiple times

496
00:17:19,218 --> 00:17:20,786
in a computational graph,

497
00:17:20,786 --> 00:17:22,257
then remember during the backward pass,

498
00:17:22,257 --> 00:17:24,047
you end up summing the gradients

499
00:17:24,047 --> 00:17:25,257
into the w matrix

500
00:17:25,257 --> 00:17:28,218
when you're computing a d los d w.

501
00:17:28,218 --> 00:17:30,557
So, if you kind of think about

502
00:17:30,557 --> 00:17:32,526
the back propagation for this model,

503
00:17:32,526 --> 00:17:34,555
then you'll have a separate gradient for w

504
00:17:34,555 --> 00:17:36,506
flowing from each of those time steps,

505
00:17:36,506 --> 00:17:38,276
and then the final gradient for w

506
00:17:38,276 --> 00:17:39,586
will be the sum of all of those

507
00:17:39,586 --> 00:17:42,503
individual per time step gradiants.

508
00:17:43,615 --> 00:17:46,073
We can also write to this y t explicitly

509
00:17:46,073 --> 00:17:47,727
in this computational graph.

510
00:17:47,727 --> 00:17:50,668
So then, this output,
h t, at every time step

511
00:17:50,668 --> 00:17:52,967
might feed into some other
little neural network

512
00:17:52,967 --> 00:17:54,858
that can produce a y t,

513
00:17:54,858 --> 00:17:57,256
which might be some class
scores, or something like that,

514
00:17:57,256 --> 00:17:59,087
at every time step.

515
00:17:59,087 --> 00:18:00,738
We can also make the loss more explicit.

516
00:18:00,738 --> 00:18:03,167
So in many cases, you
might imagine producing,

517
00:18:03,167 --> 00:18:05,778
you might imagine that you
have some ground truth label

518
00:18:05,778 --> 00:18:07,588
at every time step of your sequence,

519
00:18:07,588 --> 00:18:08,978
and then you'll compute some loss,

520
00:18:08,978 --> 00:18:10,487
some individual loss,

521
00:18:10,487 --> 00:18:11,596
at every time step of

522
00:18:11,596 --> 00:18:14,068
these outputs, y t's.

523
00:18:14,068 --> 00:18:14,915
And this loss might,

524
00:18:14,915 --> 00:18:17,556
it will frequently be
something like soft max loss,

525
00:18:17,556 --> 00:18:19,497
in the case where you have, maybe,

526
00:18:19,497 --> 00:18:22,497
a ground truth label at every
time step of the sequence.

527
00:18:22,497 --> 00:18:24,076
And now the final loss for the entire,

528
00:18:24,076 --> 00:18:25,395
for this entire training stop,

529
00:18:25,395 --> 00:18:27,887
will be the sum of
these individual losses.

530
00:18:27,887 --> 00:18:30,485
So now, we had a scaler
loss at every time step?

531
00:18:30,485 --> 00:18:32,818
And we just summed them
up to get our final

532
00:18:32,818 --> 00:18:34,196
scaler loss at the top of the network.

533
00:18:34,196 --> 00:18:36,207
And now, if you think about,

534
00:18:36,207 --> 00:18:37,898
again, back propagation
through this thing,

535
00:18:37,898 --> 00:18:38,896
we need, in order to train the model,

536
00:18:38,896 --> 00:18:40,215
we need to compute the gradient

537
00:18:40,215 --> 00:18:42,098
of the loss with respect to w.

538
00:18:42,098 --> 00:18:44,498
So, we'll have loss flowing
from that final loss

539
00:18:44,498 --> 00:18:46,178
into each of these time steps.

540
00:18:46,178 --> 00:18:47,708
And then each of those time steps

541
00:18:47,708 --> 00:18:49,840
will compute a local
gradient on the weights, w,

542
00:18:49,840 --> 00:18:52,010
which will all then be
summed to give us our final

543
00:18:52,010 --> 00:18:54,343
gradient for the weights, w.

544
00:18:55,597 --> 00:18:58,268
Now if we have a, sort of,
this many to one situation,

545
00:18:58,268 --> 00:19:01,188
where maybe we want to do
something like sentiment analysis,

546
00:19:01,188 --> 00:19:03,348
then we would typically make that decision

547
00:19:03,348 --> 00:19:05,799
based on the final hidden
state of this network.

548
00:19:05,799 --> 00:19:07,450
Because this final hidden state

549
00:19:07,450 --> 00:19:08,988
kind of summarizes all of the context

550
00:19:08,988 --> 00:19:11,868
from the entire sequence.

551
00:19:11,868 --> 00:19:14,788
Also, if we have a kind of
a one to many situation,

552
00:19:14,788 --> 00:19:17,319
where we want to receive a fix sized input

553
00:19:17,319 --> 00:19:19,319
and then produce a variably sized output.

554
00:19:19,319 --> 00:19:22,698
Then you'll commonly use
that fixed size input

555
00:19:22,698 --> 00:19:23,988
to initialize, somehow,

556
00:19:23,988 --> 00:19:26,050
the initial hidden state of the model,

557
00:19:26,050 --> 00:19:27,986
and now the recurrent network will tick

558
00:19:27,986 --> 00:19:30,079
for each cell in the output.

559
00:19:30,079 --> 00:19:32,748
And now, as you produce
your variably sized output,

560
00:19:32,748 --> 00:19:36,915
you'll unroll the graph for
each element in the output.

561
00:19:38,490 --> 00:19:41,370
So this, when we talk about
the sequence to sequence models

562
00:19:41,370 --> 00:19:44,308
where you might do something
like machine translation,

563
00:19:44,308 --> 00:19:46,099
where you take a variably sized input

564
00:19:46,099 --> 00:19:47,648
and a variably sized output.

565
00:19:47,648 --> 00:19:49,300
You can think of this as a combination

566
00:19:49,300 --> 00:19:50,719
of the many to one,

567
00:19:50,719 --> 00:19:52,398
plus a one to many.

568
00:19:52,398 --> 00:19:54,889
So, we'll kind of proceed in two stages,

569
00:19:54,889 --> 00:19:56,900
what we call an encoder and a decoder.

570
00:19:56,900 --> 00:19:58,119
So if you're the encoder,

571
00:19:58,119 --> 00:20:00,119
we'll receive the variably sized input,

572
00:20:00,119 --> 00:20:02,159
which might be your sentence in English,

573
00:20:02,159 --> 00:20:04,311
and then summarize that entire sentence

574
00:20:04,311 --> 00:20:08,110
using the final hidden state
of the encoder network.

575
00:20:08,110 --> 00:20:11,894
And now we're in this
many to one situation

576
00:20:11,894 --> 00:20:14,396
where we've summarized this
entire variably sized input

577
00:20:14,396 --> 00:20:15,769
in this single vector,

578
00:20:15,769 --> 00:20:17,449
and now, we have a second decoder network,

579
00:20:17,449 --> 00:20:19,409
which is a one to many situation,

580
00:20:19,409 --> 00:20:21,740
which will input that single vector

581
00:20:21,740 --> 00:20:23,111
summarizing the input sentence

582
00:20:23,111 --> 00:20:25,487
and now produce this
variably sized output,

583
00:20:25,487 --> 00:20:28,969
which might be your sentence
in another language.

584
00:20:28,969 --> 00:20:30,980
And now in this variably sized output,

585
00:20:30,980 --> 00:20:32,820
we might make some predictions
at every time step,

586
00:20:32,820 --> 00:20:34,609
maybe about what word to use.

587
00:20:34,609 --> 00:20:36,631
And you can imagine kind of
training this entire thing

588
00:20:36,631 --> 00:20:38,199
by unrolling this computational graph

589
00:20:38,199 --> 00:20:40,608
summing the losses at the output sequence

590
00:20:40,608 --> 00:20:44,692
and just performing back
propagation, as usual.

591
00:20:44,692 --> 00:20:46,391
So as a bit of a concrete example,

592
00:20:46,391 --> 00:20:47,729
one thing that we frequently use

593
00:20:47,729 --> 00:20:48,961
recurrent neural networks for,

594
00:20:48,961 --> 00:20:50,940
is this problem called language modeling.

595
00:20:50,940 --> 00:20:52,990
So in the language modeling problem,

596
00:20:52,990 --> 00:20:55,060
we want to read some sequence of,

597
00:20:55,060 --> 00:20:58,151
we want to have our
network, sort of, understand

598
00:20:58,151 --> 00:21:00,908
how to produce natural language.

599
00:21:00,908 --> 00:21:04,071
So in the, so this, this might
happen at the character level

600
00:21:04,071 --> 00:21:06,601
where our model will produce
characters one at a time.

601
00:21:06,601 --> 00:21:08,389
This might also happen at the word level

602
00:21:08,389 --> 00:21:10,769
where our model will
produce words one at a time.

603
00:21:10,769 --> 00:21:12,129
But in a very simple example,

604
00:21:12,129 --> 00:21:14,740
you can imagine this
character level language model

605
00:21:14,740 --> 00:21:15,980
where we want,

606
00:21:15,980 --> 00:21:18,081
where the network will read
some sequence of characters

607
00:21:18,081 --> 00:21:19,388
and then it needs to predict,

608
00:21:19,388 --> 00:21:22,780
what will the next character
be in this stream of text?

609
00:21:22,780 --> 00:21:24,700
So in this example,

610
00:21:24,700 --> 00:21:27,460
we have this very small
vocabulary of four letters,

611
00:21:27,460 --> 00:21:31,213
h, e, l, and o, and we have
this example training sequence

612
00:21:31,213 --> 00:21:33,884
of the word hello, h, e, l, l, o.

613
00:21:33,884 --> 00:21:34,911
So during training,

614
00:21:34,911 --> 00:21:37,586
when we're training this language model,

615
00:21:37,586 --> 00:21:42,100
we will feed the characters
of this training sequence

616
00:21:42,100 --> 00:21:46,119
as inputs, as x ts, to out input of our,

617
00:21:46,119 --> 00:21:49,689
we'll feed the characters
of our training sequence,

618
00:21:49,689 --> 00:21:52,191
these will be the x ts that
we feed in as the inputs

619
00:21:52,191 --> 00:21:53,980
to our recurrent neural network.

620
00:21:53,980 --> 00:21:56,168
And then, each of these inputs,

621
00:21:56,168 --> 00:21:57,678
it's a letter,

622
00:21:57,678 --> 00:21:58,890
and we need to figure out a way

623
00:21:58,890 --> 00:22:01,039
to represent letters in our network.

624
00:22:01,039 --> 00:22:02,759
So what we'll typically do is figure out

625
00:22:02,759 --> 00:22:05,100
what is our total vocabulary.

626
00:22:05,100 --> 00:22:07,460
In this case, our vocabulary
has four elements.

627
00:22:07,460 --> 00:22:09,967
And each letter will be
represented by a vector

628
00:22:09,967 --> 00:22:12,589
that has zeros in every slot but one,

629
00:22:12,589 --> 00:22:15,031
and a one for the slot in the vocabulary

630
00:22:15,031 --> 00:22:16,628
corresponding to that letter.

631
00:22:16,628 --> 00:22:18,092
In this little example,

632
00:22:18,092 --> 00:22:20,751
since our vocab has the
four letters, h, e, l, o,

633
00:22:20,751 --> 00:22:22,324
then our input sequence,

634
00:22:22,324 --> 00:22:25,374
the h is represented by
a four element vector

635
00:22:25,374 --> 00:22:26,535
with a one in the first slot

636
00:22:26,535 --> 00:22:28,684
and zero's in the other three slots.

637
00:22:28,684 --> 00:22:29,876
And we use the same sort of pattern

638
00:22:29,876 --> 00:22:31,306
to represent all the different letters

639
00:22:31,306 --> 00:22:33,139
in the input sequence.

640
00:22:34,914 --> 00:22:37,021
Now, during this forward pass

641
00:22:37,021 --> 00:22:38,575
of what this network is doing,

642
00:22:38,575 --> 00:22:39,965
at the first time step,

643
00:22:39,965 --> 00:22:41,874
it will receive the input letter h.

644
00:22:41,874 --> 00:22:45,285
That will go into the first RNN,

645
00:22:45,285 --> 00:22:46,714
to the RNN cell,

646
00:22:46,714 --> 00:22:48,594
and then we'll produce this output, y t,

647
00:22:48,594 --> 00:22:50,285
which is the network making predictions

648
00:22:50,285 --> 00:22:52,525
about for each letter in the vocabulary,

649
00:22:52,525 --> 00:22:54,411
which letter does it think is most likely

650
00:22:54,411 --> 00:22:56,024
going to come next.

651
00:22:56,024 --> 00:22:57,245
In this example,

652
00:22:57,245 --> 00:22:59,330
the correct output letter was e

653
00:22:59,330 --> 00:23:01,405
because our training sequence was hello,

654
00:23:01,405 --> 00:23:04,488
but the model is actually predicting,

655
00:23:05,477 --> 00:23:06,310
I think it's actually

656
00:23:06,310 --> 00:23:07,850
predicting o as the most likely letter.

657
00:23:07,850 --> 00:23:09,690
So in this case, this prediction was wrong

658
00:23:09,690 --> 00:23:11,210
and we would use softmaxt loss

659
00:23:11,210 --> 00:23:13,889
to quantify our unhappiness
with these predictions.

660
00:23:13,889 --> 00:23:15,192
The next time step,

661
00:23:15,192 --> 00:23:16,680
we would feed in the second letter

662
00:23:16,680 --> 00:23:18,031
in the training sequence, e,

663
00:23:18,031 --> 00:23:19,741
and this process will repeat.

664
00:23:19,741 --> 00:23:22,181
We'll now represent e as a vector.

665
00:23:22,181 --> 00:23:24,232
Use that input vector together

666
00:23:24,232 --> 00:23:25,649
with the previous hidden state

667
00:23:25,649 --> 00:23:27,271
to produce a new hidden state

668
00:23:27,271 --> 00:23:28,792
and now use the second hidden state

669
00:23:28,792 --> 00:23:30,032
to, again, make predictions

670
00:23:30,032 --> 00:23:31,912
over every letter in the vocabulary.

671
00:23:31,912 --> 00:23:34,232
In this case, because our
training sequence was hello,

672
00:23:34,232 --> 00:23:35,712
after the letter e,

673
00:23:35,712 --> 00:23:36,810
we want our model to predict l.

674
00:23:36,810 --> 00:23:37,893
In this case,

675
00:23:40,183 --> 00:23:41,832
our model may have very low predictions

676
00:23:41,832 --> 00:23:44,244
for the letter l, so we
would incur high loss.

677
00:23:44,244 --> 00:23:46,794
And you kind of repeat
this process over and over,

678
00:23:46,794 --> 00:23:50,343
and if you train this model
with many different sequences,

679
00:23:50,343 --> 00:23:52,023
then eventually it should learn

680
00:23:52,023 --> 00:23:54,303
how to predict the next
character in a sequence

681
00:23:54,303 --> 00:23:56,763
based on the context of
all the previous characters

682
00:23:56,763 --> 00:23:58,596
that it's seen before.

683
00:23:59,893 --> 00:24:01,655
And now, if you think about
what happens at test time,

684
00:24:01,655 --> 00:24:03,388
after we train this model,

685
00:24:03,388 --> 00:24:05,033
one thing that we might want to do with it

686
00:24:05,033 --> 00:24:07,594
is a sample from the model,

687
00:24:07,594 --> 00:24:09,673
and actually use this
trained neural network model

688
00:24:09,673 --> 00:24:11,764
to synthesize new text

689
00:24:11,764 --> 00:24:13,592
that kind of looks similar in spirit

690
00:24:13,592 --> 00:24:15,103
to the text that it was trained on.

691
00:24:15,103 --> 00:24:16,312
The way that this will work

692
00:24:16,312 --> 00:24:17,833
is we'll typically see the model

693
00:24:17,833 --> 00:24:19,634
with some input prefix of text.

694
00:24:19,634 --> 00:24:22,716
In this case, the prefix is
just the single letter h,

695
00:24:22,716 --> 00:24:24,327
and now we'll feed that letter h

696
00:24:24,327 --> 00:24:27,295
through the first time step of
our recurrent neural network.

697
00:24:27,295 --> 00:24:30,335
It will product this
distribution of scores

698
00:24:30,335 --> 00:24:32,916
over all the characters in the vocabulary.

699
00:24:32,916 --> 00:24:35,592
Now, at training time,
we'll use these scores

700
00:24:35,592 --> 00:24:37,501
to actually sample from it.

701
00:24:37,501 --> 00:24:38,893
So we'll use a softmaxt function

702
00:24:38,893 --> 00:24:41,421
to convert those scores into
a probability distribution

703
00:24:41,421 --> 00:24:44,162
and then we will sample from
that probability distribution

704
00:24:44,162 --> 00:24:47,362
to actually synthesize the
second letter in the sequence.

705
00:24:47,362 --> 00:24:50,491
And in this case, even though
the scores were pretty bad,

706
00:24:50,491 --> 00:24:53,061
maybe we got lucky and
sampled the letter e

707
00:24:53,061 --> 00:24:54,771
from this probability distribution.

708
00:24:54,771 --> 00:24:56,962
And now, we'll take this letter e

709
00:24:56,962 --> 00:24:59,253
that was sampled from this distribution

710
00:24:59,253 --> 00:25:01,522
and feed it back as input into the network

711
00:25:01,522 --> 00:25:02,492
at the next time step.

712
00:25:02,492 --> 00:25:05,118
Now, we'll take this e,
pull it down from the top,

713
00:25:05,118 --> 00:25:07,170
feed it back into the network

714
00:25:07,170 --> 00:25:10,071
as one of these, sort of, one
hot vectorial representations,

715
00:25:10,071 --> 00:25:11,530
and then repeat the process

716
00:25:11,530 --> 00:25:15,008
in order to synthesize the
second letter in the output.

717
00:25:15,008 --> 00:25:17,290
And we can repeat this
process over and over again

718
00:25:17,290 --> 00:25:20,722
to synthesize a new sequence
using this trained model,

719
00:25:20,722 --> 00:25:22,442
where we're synthesizing the sequence

720
00:25:22,442 --> 00:25:23,770
one character at a time

721
00:25:23,770 --> 00:25:25,922
using these predicted
probability distributions

722
00:25:25,922 --> 00:25:27,712
at each time step.

723
00:25:27,712 --> 00:25:28,545
Question?

724
00:25:34,792 --> 00:25:36,181
Yeah, that's a great question.

725
00:25:36,181 --> 00:25:37,960
So the question is why might we sample

726
00:25:37,960 --> 00:25:39,381
instead of just taking the character

727
00:25:39,381 --> 00:25:41,315
with the largest score?

728
00:25:41,315 --> 00:25:42,398
In this case,

729
00:25:43,294 --> 00:25:45,624
because of the probability
distribution that we had,

730
00:25:45,624 --> 00:25:47,451
it was impossible to
get the right character,

731
00:25:47,451 --> 00:25:49,642
so we had the sample so
the example could work out,

732
00:25:49,642 --> 00:25:51,512
and it would make sense.

733
00:25:51,512 --> 00:25:54,384
But in practice,
sometimes you'll see both.

734
00:25:54,384 --> 00:25:56,432
So sometimes you'll just
take the argmax probability,

735
00:25:56,432 --> 00:25:59,482
and that will sometimes be
a little bit more stable,

736
00:25:59,482 --> 00:26:01,791
but one advantage of sampling, in general,

737
00:26:01,791 --> 00:26:04,264
is that it lets you get
diversity from your models.

738
00:26:04,264 --> 00:26:07,242
Sometimes you might have the same input,

739
00:26:07,242 --> 00:26:08,490
maybe the same prefix,

740
00:26:08,490 --> 00:26:09,722
or in the case of image captioning,

741
00:26:09,722 --> 00:26:11,421
maybe the same image.

742
00:26:11,421 --> 00:26:13,583
But then if you sample rather
than taking the argmax,

743
00:26:13,583 --> 00:26:15,562
then you'll see that
sometimes these trained models

744
00:26:15,562 --> 00:26:18,352
are actually able to produce
multiple different types

745
00:26:18,352 --> 00:26:20,032
of reasonable output sequences,

746
00:26:20,032 --> 00:26:21,053
depending on the kind,

747
00:26:21,053 --> 00:26:22,573
depending on which samples they take

748
00:26:22,573 --> 00:26:23,824
at the first time steps.

749
00:26:23,824 --> 00:26:25,973
It's actually kind of a benefit

750
00:26:25,973 --> 00:26:29,213
cause we can get now more
diversity in our outputs.

751
00:26:29,213 --> 00:26:30,630
Another question?

752
00:26:35,143 --> 00:26:37,153
Could we feed in the softmax vector

753
00:26:37,153 --> 00:26:38,602
instead of the one element vector?

754
00:26:38,602 --> 00:26:40,435
You mean at test time?

755
00:26:46,162 --> 00:26:47,992
Yeah yeah, so the
question is, at test time,

756
00:26:47,992 --> 00:26:49,712
could we feed in this whole softmax vector

757
00:26:49,712 --> 00:26:51,373
rather than a one hot vector?

758
00:26:51,373 --> 00:26:52,752
There's kind of two problems with that.

759
00:26:52,752 --> 00:26:54,762
One is that that's very different

760
00:26:54,762 --> 00:26:56,782
from the data that it
saw at training time.

761
00:26:56,782 --> 00:26:58,792
In general, if you ask your model

762
00:26:58,792 --> 00:26:59,943
to do something at test time,

763
00:26:59,943 --> 00:27:01,223
which is different from training time,

764
00:27:01,223 --> 00:27:02,752
then it'll usually blow up.

765
00:27:02,752 --> 00:27:03,784
It'll usually give you garbage

766
00:27:03,784 --> 00:27:05,413
and you'll usually be sad.

767
00:27:05,413 --> 00:27:07,083
The other problem is that in practice,

768
00:27:07,083 --> 00:27:09,112
our vocabularies might be very large.

769
00:27:09,112 --> 00:27:10,480
So maybe, in this simple example,

770
00:27:10,480 --> 00:27:12,182
our vocabulary is only four elements,

771
00:27:12,182 --> 00:27:13,202
so it's not a big problem.

772
00:27:13,202 --> 00:27:15,611
But if you're thinking about
generating words one at a time,

773
00:27:15,611 --> 00:27:18,773
now your vocabulary is every
word in the English language,

774
00:27:18,773 --> 00:27:21,162
which could be something like
tens of thousands of elements.

775
00:27:21,162 --> 00:27:23,642
So in practice, this first element,

776
00:27:23,642 --> 00:27:26,373
this first operation that's
taking in this one hot vector,

777
00:27:26,373 --> 00:27:29,173
is often performed using
sparse vector operations

778
00:27:29,173 --> 00:27:30,533
rather than dense factors.

779
00:27:30,533 --> 00:27:32,693
It would be, sort of,
computationally really bad

780
00:27:32,693 --> 00:27:35,018
if you wanted to have this load of

781
00:27:35,018 --> 00:27:36,827
10,000 elements softmax vector.

782
00:27:36,827 --> 00:27:38,802
So that's usually why we
use a one hot instead,

783
00:27:38,802 --> 00:27:40,302
even at test time.

784
00:27:42,121 --> 00:27:44,482
This idea that we have a sequence

785
00:27:44,482 --> 00:27:47,104
and we produce an output at
every time step of the sequence

786
00:27:47,104 --> 00:27:48,693
and then finally compute some loss,

787
00:27:48,693 --> 00:27:51,543
this is sometimes called
backpropagation through time

788
00:27:51,543 --> 00:27:53,962
because you're imagining
that in the forward pass,

789
00:27:53,962 --> 00:27:55,819
you're kind of stepping
forward through time

790
00:27:55,819 --> 00:27:57,053
and then during the backward pass,

791
00:27:57,053 --> 00:27:59,042
you're sort of going
backwards through time

792
00:27:59,042 --> 00:28:00,762
to compute all your gradients.

793
00:28:00,762 --> 00:28:03,389
This can actually be kind of problematic

794
00:28:03,389 --> 00:28:06,162
if you want to train the sequences
that are very, very long.

795
00:28:06,162 --> 00:28:08,333
So if you imagine that we
were kind of trying to train

796
00:28:08,333 --> 00:28:09,602
a neural network language model

797
00:28:09,602 --> 00:28:12,107
on maybe the entire text of Wikipedia,

798
00:28:12,107 --> 00:28:13,171
which is, by the way,

799
00:28:13,171 --> 00:28:15,453
something that people
do pretty frequently,

800
00:28:15,453 --> 00:28:16,672
this would be super slow,

801
00:28:16,672 --> 00:28:19,282
and every time we made a gradient step,

802
00:28:19,282 --> 00:28:20,670
we would have to make a forward pass

803
00:28:20,670 --> 00:28:23,328
through the entire text
of all of wikipedia,

804
00:28:23,328 --> 00:28:25,901
and then make a backward pass
through all of wikipedia,

805
00:28:25,901 --> 00:28:27,813
and then make a single gradient update.

806
00:28:27,813 --> 00:28:28,750
And that would be super slow.

807
00:28:28,750 --> 00:28:29,984
Your model would never converge.

808
00:28:29,984 --> 00:28:31,853
It would also take a
ridiculous amount of memory

809
00:28:31,853 --> 00:28:34,172
so this would be just really bad.

810
00:28:34,172 --> 00:28:36,384
In practice, what people
do is this, sort of,

811
00:28:36,384 --> 00:28:39,933
approximation called truncated
backpropagation through time.

812
00:28:39,933 --> 00:28:41,522
Here, the idea is that,

813
00:28:41,522 --> 00:28:43,973
even though our input
sequence is very, very long,

814
00:28:43,973 --> 00:28:45,562
and even potentially infinite,

815
00:28:45,562 --> 00:28:47,184
what we'll do is that during,

816
00:28:47,184 --> 00:28:48,521
when we're training the model,

817
00:28:48,521 --> 00:28:50,544
we'll step forward for
some number of steps,

818
00:28:50,544 --> 00:28:54,402
maybe like a hundred is
kind of a ballpark number

819
00:28:54,402 --> 00:28:56,232
that people frequently use,

820
00:28:56,232 --> 00:28:58,623
and we'll step forward
for maybe a hundred steps,

821
00:28:58,623 --> 00:29:01,944
compute a loss only over this
sub sequence of the data,

822
00:29:01,944 --> 00:29:04,312
and then back propagate
through this sub sequence,

823
00:29:04,312 --> 00:29:06,261
and now make a gradient step.

824
00:29:06,261 --> 00:29:08,102
And now, when we repeat, well,

825
00:29:08,102 --> 00:29:10,072
we still have these hidden states

826
00:29:10,072 --> 00:29:12,064
that we computed from the first batch,

827
00:29:12,064 --> 00:29:14,664
and now, when we compute
this next batch of data,

828
00:29:14,664 --> 00:29:17,933
we will carry those hidden
states forward in time,

829
00:29:17,933 --> 00:29:20,631
so the forward pass will
be exactly the same.

830
00:29:20,631 --> 00:29:23,184
But now when we compute a gradient step

831
00:29:23,184 --> 00:29:24,392
for this next batch of data,

832
00:29:24,392 --> 00:29:28,124
we will only backpropagate
again through this second batch.

833
00:29:28,124 --> 00:29:29,659
Now, we'll make a gradient step

834
00:29:29,659 --> 00:29:32,760
based on this truncated
backpropagation through time.

835
00:29:32,760 --> 00:29:34,392
This process will continue,

836
00:29:34,392 --> 00:29:36,440
where now when we make the next batch,

837
00:29:36,440 --> 00:29:38,250
we'll again copy these
hidden states forward,

838
00:29:38,250 --> 00:29:40,930
but then step forward
and then step backward,

839
00:29:40,930 --> 00:29:43,840
but only for some small
number of time steps.

840
00:29:43,840 --> 00:29:44,673
So this is,

841
00:29:44,673 --> 00:29:45,600
you can kind of think of this

842
00:29:45,600 --> 00:29:48,250
as being an alegist who's
the cast at gradient descent

843
00:29:48,250 --> 00:29:49,872
in the case of sequences.

844
00:29:49,872 --> 00:29:52,280
Remember, when we talked
about training our models

845
00:29:52,280 --> 00:29:53,690
on large data sets,

846
00:29:53,690 --> 00:29:54,672
then these data sets,

847
00:29:54,672 --> 00:29:56,880
it would be super expensive
to compute the gradients

848
00:29:56,880 --> 00:29:58,720
over every element in the data set.

849
00:29:58,720 --> 00:30:00,861
So instead, we kind of take small samples,

850
00:30:00,861 --> 00:30:02,520
small mini batches instead,

851
00:30:02,520 --> 00:30:05,290
and use mini batches of data
to compute gradient stops

852
00:30:05,290 --> 00:30:08,053
in any kind of image classification case.

853
00:30:08,053 --> 00:30:08,886
Question?

854
00:30:12,441 --> 00:30:14,415
Is this kind of, the question is,

855
00:30:14,415 --> 00:30:15,989
is this kind of making
the Mark Hobb assumption?

856
00:30:15,989 --> 00:30:17,239
No, not really.

857
00:30:18,100 --> 00:30:20,001
Because we're carrying
this hidden state forward

858
00:30:20,001 --> 00:30:21,442
in time forever.

859
00:30:21,442 --> 00:30:23,512
It's making a Marcovian assumption

860
00:30:23,512 --> 00:30:25,792
in the sense that, conditioned
on the hidden state,

861
00:30:25,792 --> 00:30:27,733
but the hidden state is all that we need

862
00:30:27,733 --> 00:30:28,971
to predict the entire future

863
00:30:28,971 --> 00:30:31,101
of the sequence.

864
00:30:31,101 --> 00:30:33,019
But that assumption is kind of built

865
00:30:33,019 --> 00:30:35,012
into the recurrent neural network formula

866
00:30:35,012 --> 00:30:35,941
from the start.

867
00:30:35,941 --> 00:30:37,269
And that's not really particular

868
00:30:37,269 --> 00:30:39,032
to back propagation through time.

869
00:30:39,032 --> 00:30:40,372
Back propagation through time,

870
00:30:40,372 --> 00:30:43,061
or sorry, truncated back prop though time

871
00:30:43,061 --> 00:30:44,352
is just the way to
approximate these gradients

872
00:30:44,352 --> 00:30:47,072
without going making a backwards pass

873
00:30:47,072 --> 00:30:51,239
through your potentially
very large sequence of data.

874
00:30:52,677 --> 00:30:55,301
This all sounds very
complicated and confusing

875
00:30:55,301 --> 00:30:56,871
and it sounds like a lot of code to write,

876
00:30:56,871 --> 00:30:59,649
but in fact, this can
acutally be pretty concise.

877
00:30:59,649 --> 00:31:03,418
Andrea has this example of
what he calls min-char-rnn,

878
00:31:03,418 --> 00:31:04,816
that does all of this stuff

879
00:31:04,816 --> 00:31:07,474
in just like a 112 lines of Python.

880
00:31:07,474 --> 00:31:09,045
It handles building the vocabulary.

881
00:31:09,045 --> 00:31:09,925
It trains the model

882
00:31:09,925 --> 00:31:11,725
with truncated back
propagation through time.

883
00:31:11,725 --> 00:31:13,936
And then, it can actually
sample from that model

884
00:31:13,936 --> 00:31:16,584
in actually not too much code.

885
00:31:16,584 --> 00:31:17,925
So even though this sounds like

886
00:31:17,925 --> 00:31:18,965
kind of a big, scary process,

887
00:31:18,965 --> 00:31:20,862
it's actually not too difficult.

888
00:31:20,862 --> 00:31:22,314
I'd encourage you, if you're confused,

889
00:31:22,314 --> 00:31:23,456
to maybe go check this out

890
00:31:23,456 --> 00:31:25,125
and step through the
code on your own time,

891
00:31:25,125 --> 00:31:27,074
and see, kind of, all
of these concrete steps

892
00:31:27,074 --> 00:31:27,954
happening in code.

893
00:31:27,954 --> 00:31:30,184
So this is all in just a single file,

894
00:31:30,184 --> 00:31:31,723
all using numpy with no dependencies.

895
00:31:31,723 --> 00:31:34,473
This was relatively easy to read.

896
00:31:35,584 --> 00:31:36,896
So then, once we have this idea

897
00:31:36,896 --> 00:31:39,056
of training a recurrent
neural network language model,

898
00:31:39,056 --> 00:31:41,593
we can actually have a
lot of fun with this.

899
00:31:41,593 --> 00:31:43,984
And we can take in, sort
of, any text that we want.

900
00:31:43,984 --> 00:31:46,424
Take in, like, whatever
random text you can think of

901
00:31:46,424 --> 00:31:47,384
from the internet,

902
00:31:47,384 --> 00:31:49,656
train our recurrent neural
network language model

903
00:31:49,656 --> 00:31:50,616
on this text,

904
00:31:50,616 --> 00:31:52,304
and then generate new text.

905
00:31:52,304 --> 00:31:55,502
So in this example, we
took this entire text

906
00:31:55,502 --> 00:31:57,394
of all of Shakespeare's works,

907
00:31:57,394 --> 00:31:59,314
and then used that to train

908
00:31:59,314 --> 00:32:00,885
a recurrent neural network language model

909
00:32:00,885 --> 00:32:02,634
on all of Shakespeare.

910
00:32:02,634 --> 00:32:03,896
And you can see that the
beginning of training,

911
00:32:03,896 --> 00:32:05,773
it's kind of producing maybe
random gibberish garbage,

912
00:32:05,773 --> 00:32:08,216
but throughout the course of training,

913
00:32:08,216 --> 00:32:11,584
it ends up producing things
that seem relatively reasonable.

914
00:32:11,584 --> 00:32:12,784
And after you've,

915
00:32:12,784 --> 00:32:14,272
after this model has
been trained pretty well,

916
00:32:14,272 --> 00:32:16,104
then it produces text that seems,

917
00:32:16,104 --> 00:32:18,224
kind of, Shakespeare-esque to me.

918
00:32:18,224 --> 00:32:20,264
"Why do what that day," replied,

919
00:32:20,264 --> 00:32:23,184
whatever, right, you can read this.

920
00:32:23,184 --> 00:32:24,754
Like, it kind of looks
kind of like Shakespeare.

921
00:32:24,754 --> 00:32:27,184
And if you actually train
this model even more,

922
00:32:27,184 --> 00:32:28,765
and let it converge even further,

923
00:32:28,765 --> 00:32:31,264
and then sample these
even longer sequences,

924
00:32:31,264 --> 00:32:33,904
you can see that it learns
all kinds of crazy cool stuff

925
00:32:33,904 --> 00:32:35,864
that really looks like a Shakespeare play.

926
00:32:35,864 --> 00:32:38,856
It knows that it uses,
maybe, these headings

927
00:32:38,856 --> 00:32:40,016
to say who's speaking.

928
00:32:40,016 --> 00:32:42,224
Then it produces these bits of text

929
00:32:42,224 --> 00:32:43,736
that have crazy dialogue

930
00:32:43,736 --> 00:32:45,565
that sounds kind of Shakespeare-esque.

931
00:32:45,565 --> 00:32:46,714
It knows to put line breaks

932
00:32:46,714 --> 00:32:47,744
in between these different things.

933
00:32:47,744 --> 00:32:49,342
And this is all, like, really cool,

934
00:32:49,342 --> 00:32:52,958
all just sort of learned from
the structure of the data.

935
00:32:52,958 --> 00:32:55,536
We can actually get
even crazier than this.

936
00:32:55,536 --> 00:32:58,368
This was one of my favorite examples.

937
00:32:58,368 --> 00:33:00,443
I found online, there's this.

938
00:33:00,443 --> 00:33:02,725
Is anyone a mathematician in this room?

939
00:33:02,725 --> 00:33:05,914
Has anyone taken an algebraic
topology course by any chance?

940
00:33:05,914 --> 00:33:07,325
Wow, a couple, that's impressive.

941
00:33:07,325 --> 00:33:10,765
So you probably know more
algebraic topology than me,

942
00:33:10,765 --> 00:33:12,165
but I found this open source

943
00:33:12,165 --> 00:33:15,114
algebraic topology textbook online.

944
00:33:15,114 --> 00:33:16,514
It's just a whole bunch of tech files

945
00:33:16,514 --> 00:33:19,554
that are like this
super dense mathematics.

946
00:33:19,554 --> 00:33:22,593
And LaTac, cause LaTac is sort of this,

947
00:33:22,593 --> 00:33:24,774
let's you write equations and diagrams

948
00:33:24,774 --> 00:33:26,325
and everything just using plain text.

949
00:33:26,325 --> 00:33:27,455
We can actually train our

950
00:33:27,455 --> 00:33:28,816
recurrent neural network language model

951
00:33:28,816 --> 00:33:31,457
on the raw Latac source code

952
00:33:31,457 --> 00:33:33,146
of this algebraic topology textbook.

953
00:33:33,146 --> 00:33:37,687
And if we do that, then after
we sample from the model,

954
00:33:37,687 --> 00:33:39,208
then we get something that seems like,

955
00:33:39,208 --> 00:33:41,446
kind of like algebraic topology.

956
00:33:41,446 --> 00:33:43,797
So it knows to like put equations.

957
00:33:43,797 --> 00:33:46,679
It puts all kinds of crazy stuff.

958
00:33:46,679 --> 00:33:48,717
It's like, to prove study,

959
00:33:48,717 --> 00:33:51,574
we see that F sub U is
a covering of x prime,

960
00:33:51,574 --> 00:33:52,773
blah, blah, blah, blah, blah.

961
00:33:52,773 --> 00:33:53,992
It knows where to put unions.

962
00:33:53,992 --> 00:33:56,576
It knows to put squares
at the end of proofs.

963
00:33:56,576 --> 00:33:57,528
It makes lemmas.

964
00:33:57,528 --> 00:34:00,026
It makes references to previous lemmas.

965
00:34:00,026 --> 00:34:01,225
Right, like we hear, like.

966
00:34:01,225 --> 00:34:03,781
It's namely a bi-lemma question.

967
00:34:03,781 --> 00:34:05,917
We see that R is geometrically something.

968
00:34:05,917 --> 00:34:08,417
So it's actually pretty crazy.

969
00:34:09,496 --> 00:34:12,913
It also sometimes tries to make diagrams.

970
00:34:14,239 --> 00:34:16,132
For those of you that have
taken algebraic topology,

971
00:34:16,132 --> 00:34:17,322
you know that these commutative diagrams

972
00:34:17,322 --> 00:34:19,313
are kind of a thing
that you work with a lot

973
00:34:19,313 --> 00:34:21,153
So it kind of got the general gist

974
00:34:21,154 --> 00:34:22,353
of how to make those diagrams,

975
00:34:22,353 --> 00:34:26,379
but they actually don't make any sense.

976
00:34:26,380 --> 00:34:28,130
And actually,

977
00:34:28,130 --> 00:34:29,440
one of my favorite examples here

978
00:34:29,440 --> 00:34:32,728
is that it sometimes omits proofs.

979
00:34:32,728 --> 00:34:33,788
So it'll sometimes say,

980
00:34:33,789 --> 00:34:35,418
it'll sometimes say something like

981
00:34:35,418 --> 00:34:39,695
theorem, blah, blah, blah,
blah, blah, proof omitted.

982
00:34:39,695 --> 00:34:41,559
This thing kind of has gotten the gist

983
00:34:41,559 --> 00:34:45,392
of how some of these
math textbooks look like.

984
00:34:47,831 --> 00:34:49,122
We can have a lot of fun with this.

985
00:34:49,123 --> 00:34:50,792
So we also tried training
one of these models

986
00:34:50,792 --> 00:34:53,498
on the entire source
code of the Linux kernel.

987
00:34:53,498 --> 00:34:55,791
'Cause again, this character level stuff

988
00:34:55,792 --> 00:34:56,801
that we can train on,

989
00:34:56,801 --> 00:34:58,630
And then, when we sample this,

990
00:34:58,630 --> 00:35:01,483
it acutally again looks
like C source code.

991
00:35:01,483 --> 00:35:03,534
It knows how to write if statements.

992
00:35:03,534 --> 00:35:06,192
It has, like, pretty good
code formatting skills.

993
00:35:06,192 --> 00:35:08,110
It knows to indent after
these if statements.

994
00:35:08,110 --> 00:35:09,552
It knows to put curly braces.

995
00:35:09,552 --> 00:35:12,289
It actually even makes
comments about some things

996
00:35:12,289 --> 00:35:15,052
that are usually nonsense.

997
00:35:15,052 --> 00:35:18,089
One problem with this model is that

998
00:35:18,089 --> 00:35:19,842
it knows how to declare variables.

999
00:35:19,842 --> 00:35:23,355
But it doesn't always use the
variables that it declares.

1000
00:35:23,355 --> 00:35:24,414
And sometimes it tries

1001
00:35:24,414 --> 00:35:26,134
to use variables that
haven't been declared.

1002
00:35:26,134 --> 00:35:27,554
This wouldn't compile.

1003
00:35:27,554 --> 00:35:28,814
I would not recommend sending this

1004
00:35:28,814 --> 00:35:30,555
as a pull request to Linux.

1005
00:35:30,555 --> 00:35:34,606
This thing also figures
out how to recite the GNU,

1006
00:35:34,606 --> 00:35:37,867
this GNU license character by character.

1007
00:35:37,867 --> 00:35:40,454
It kind of knows that you
need to recite the GNU license

1008
00:35:40,454 --> 00:35:42,696
and after the license comes some includes,

1009
00:35:42,696 --> 00:35:44,962
then some other includes,
then source code.

1010
00:35:44,962 --> 00:35:46,864
This thing has actually
learned quite a lot

1011
00:35:46,864 --> 00:35:48,836
about the general structure of the data.

1012
00:35:48,836 --> 00:35:50,096
Where, again, during training,

1013
00:35:50,096 --> 00:35:51,518
all we asked this model to do

1014
00:35:51,518 --> 00:35:53,398
was try to predict the next
character in the sequence.

1015
00:35:53,398 --> 00:35:55,406
We didn't tell it any of this structure,

1016
00:35:55,406 --> 00:35:57,534
but somehow, just through the course

1017
00:35:57,534 --> 00:35:58,856
of this training process,

1018
00:35:58,856 --> 00:36:01,438
it learned a lot about
the latent structure

1019
00:36:01,438 --> 00:36:03,355
in the sequential data.

1020
00:36:05,808 --> 00:36:07,406
Yeah, so it knows how to write code.

1021
00:36:07,406 --> 00:36:10,246
It does a lot of cool stuff.

1022
00:36:10,246 --> 00:36:13,575
I had this paper with
Andre a couple years ago

1023
00:36:13,575 --> 00:36:14,966
where we trained a bunch of these models

1024
00:36:14,966 --> 00:36:17,176
and then we wanted to try
to poke into the brains

1025
00:36:17,176 --> 00:36:18,009
of these models

1026
00:36:18,009 --> 00:36:19,376
and figure out like what are they doing

1027
00:36:19,376 --> 00:36:20,856
and why are they working.

1028
00:36:20,856 --> 00:36:22,027
So we saw, in our,

1029
00:36:22,027 --> 00:36:23,386
these recurring neural networks

1030
00:36:23,386 --> 00:36:24,947
has this hidden vector which is,

1031
00:36:24,947 --> 00:36:28,165
maybe, some vector that's
updated over every time step.

1032
00:36:28,165 --> 00:36:29,838
And then what we wanted
to try to figure out is

1033
00:36:29,838 --> 00:36:32,158
could we find some elements of this vector

1034
00:36:32,158 --> 00:36:34,176
that have some Symantec
interpretable meaning.

1035
00:36:34,176 --> 00:36:35,336
So what we did is

1036
00:36:35,336 --> 00:36:37,096
we trained a neural
network language model,

1037
00:36:37,096 --> 00:36:38,507
one of these character level models

1038
00:36:38,507 --> 00:36:39,917
on one of these data sets,

1039
00:36:39,917 --> 00:36:42,965
and then we picked one of the
elements in that hidden vector

1040
00:36:42,965 --> 00:36:45,406
and now we look at what is the
value of that hidden vector

1041
00:36:45,406 --> 00:36:47,923
over the course of a sequence

1042
00:36:47,923 --> 00:36:49,926
to try to get some sense of maybe

1043
00:36:49,926 --> 00:36:52,206
what these different hidden
states are looking for.

1044
00:36:52,206 --> 00:36:54,086
When you do this, a lot
of them end up looking

1045
00:36:54,086 --> 00:36:56,406
kind of like random gibberish garbage.

1046
00:36:56,406 --> 00:36:57,736
So here again, what we've done,

1047
00:36:57,736 --> 00:36:59,806
is we've picked one
element of that vector,

1048
00:36:59,806 --> 00:37:01,467
and now we run the sequence forward

1049
00:37:01,467 --> 00:37:02,686
through the trained model,

1050
00:37:02,686 --> 00:37:04,056
and now the color of each character

1051
00:37:04,056 --> 00:37:07,271
corresponds to the
magnitude of that single

1052
00:37:07,271 --> 00:37:09,446
scaler element of the hidden
vector at every time step

1053
00:37:09,446 --> 00:37:10,876
when it's reading the sequence.

1054
00:37:10,876 --> 00:37:12,006
So you can see that a lot

1055
00:37:12,006 --> 00:37:13,958
of the vectors in these hidden states

1056
00:37:13,958 --> 00:37:15,883
are kind of not very interpretable.

1057
00:37:15,883 --> 00:37:17,404
It seems like they're
kind of doing some of this

1058
00:37:17,404 --> 00:37:19,158
low level language modeling

1059
00:37:19,158 --> 00:37:21,396
to figure out what
character should come next.

1060
00:37:21,396 --> 00:37:23,438
But some of them end up quite nice.

1061
00:37:23,438 --> 00:37:26,375
So here we found this vector
that is looking for quotes.

1062
00:37:26,375 --> 00:37:28,507
You can see that there's
this one hidden element,

1063
00:37:28,507 --> 00:37:30,318
this one element in the vector,

1064
00:37:30,318 --> 00:37:32,486
that is off, off, off, off, off blue

1065
00:37:32,486 --> 00:37:33,827
and then once it hits a quote,

1066
00:37:33,827 --> 00:37:37,136
it turns on and remains
on for the duration

1067
00:37:37,136 --> 00:37:38,303
of this quote.

1068
00:37:39,187 --> 00:37:41,025
And now when we hit the
second quotation mark,

1069
00:37:41,025 --> 00:37:42,598
then that cell turns off.

1070
00:37:42,598 --> 00:37:44,958
So somehow, even though
this model was only trained

1071
00:37:44,958 --> 00:37:46,606
to predict the next
character in a sequence,

1072
00:37:46,606 --> 00:37:48,856
it somehow learned that a useful thing,

1073
00:37:48,856 --> 00:37:49,976
in order to do this,

1074
00:37:49,976 --> 00:37:54,222
might be to have some cell
that's trying to detect quotes.

1075
00:37:54,222 --> 00:37:55,507
We also found this other cell

1076
00:37:55,507 --> 00:37:58,956
that is, looks like it's
counting the number of characters

1077
00:37:58,956 --> 00:38:00,104
since a line break.

1078
00:38:00,104 --> 00:38:01,507
So you can see that at the
beginning of each line,

1079
00:38:01,507 --> 00:38:04,176
this element starts off at zero.

1080
00:38:04,176 --> 00:38:05,536
Throughout the course of the line,

1081
00:38:05,536 --> 00:38:07,055
it's gradually more red,

1082
00:38:07,055 --> 00:38:08,486
so that value increases.

1083
00:38:08,486 --> 00:38:10,078
And then after the new line character,

1084
00:38:10,078 --> 00:38:11,976
it resets to zero.

1085
00:38:11,976 --> 00:38:13,356
So you can imagine that maybe this cell

1086
00:38:13,356 --> 00:38:15,336
is letting the network keep track

1087
00:38:15,336 --> 00:38:16,966
of when it needs to write

1088
00:38:16,966 --> 00:38:19,545
to produce these new line characters.

1089
00:38:19,545 --> 00:38:20,936
We also found some that,

1090
00:38:20,936 --> 00:38:22,987
when we trained on the linux source code,

1091
00:38:22,987 --> 00:38:25,307
we found some examples that are turning on

1092
00:38:25,307 --> 00:38:27,158
inside the conditions of if statements.

1093
00:38:27,158 --> 00:38:29,216
So this maybe allows the network

1094
00:38:29,216 --> 00:38:32,364
to differentiate whether
it's outside an if statement

1095
00:38:32,364 --> 00:38:33,838
or inside that condition,

1096
00:38:33,838 --> 00:38:35,806
which might help it model
these sequences better.

1097
00:38:35,806 --> 00:38:38,976
We also found some that
turn on in comments,

1098
00:38:38,976 --> 00:38:41,467
or some that seem like they're counting

1099
00:38:41,467 --> 00:38:44,765
the number of indentation levels.

1100
00:38:44,765 --> 00:38:46,298
This is all just really cool stuff

1101
00:38:46,298 --> 00:38:47,488
because it's saying that

1102
00:38:47,488 --> 00:38:49,630
even though we are only
trying to train this model

1103
00:38:49,630 --> 00:38:50,758
to predict next characters,

1104
00:38:50,758 --> 00:38:52,229
it somehow ends up learning a lot

1105
00:38:52,229 --> 00:38:55,646
of useful structure about the input data.

1106
00:38:57,161 --> 00:38:59,307
One kind of thing that we often use,

1107
00:38:59,307 --> 00:39:01,718
so this is not really been
computer vision so far,

1108
00:39:01,718 --> 00:39:03,878
and we need to pull this
back to computer vision

1109
00:39:03,878 --> 00:39:05,528
since this is a vision class.

1110
00:39:05,528 --> 00:39:07,107
We've alluded many times to this

1111
00:39:07,107 --> 00:39:08,747
image captioning model

1112
00:39:08,747 --> 00:39:10,528
where we want to build
models that can input

1113
00:39:10,528 --> 00:39:14,376
an image and then output a
caption in natural language.

1114
00:39:14,376 --> 00:39:17,408
There were a bunch of
papers a couple years ago

1115
00:39:17,408 --> 00:39:19,787
that all had relatively
similar approaches.

1116
00:39:19,787 --> 00:39:22,776
But I'm showing the figure
from the paper from our lab

1117
00:39:22,776 --> 00:39:25,026
in a totally un-biased way.

1118
00:39:26,876 --> 00:39:29,198
But, the idea here is that the caption

1119
00:39:29,198 --> 00:39:31,648
is this variably length
sequence that we might,

1120
00:39:31,648 --> 00:39:34,198
the sequence might have different numbers

1121
00:39:34,198 --> 00:39:35,608
of words for different captions.

1122
00:39:35,608 --> 00:39:37,059
So this is a totally natural fit

1123
00:39:37,059 --> 00:39:39,947
for a recurrent neural
network language model.

1124
00:39:39,947 --> 00:39:41,598
So then what this model looks like

1125
00:39:41,598 --> 00:39:43,878
is we have some convolutional network

1126
00:39:43,878 --> 00:39:44,958
which will input the,

1127
00:39:44,958 --> 00:39:46,786
which will take as input the image,

1128
00:39:46,786 --> 00:39:48,397
and we've seen a lot about how

1129
00:39:48,397 --> 00:39:50,379
convolution networks work at this point,

1130
00:39:50,379 --> 00:39:52,753
and that convolutional
network will produce

1131
00:39:52,753 --> 00:39:54,569
a summary vector of the image

1132
00:39:54,569 --> 00:39:56,227
which will then feed
into the first time step

1133
00:39:56,227 --> 00:39:58,928
of one of these recurrent
neural network language models

1134
00:39:58,928 --> 00:40:02,888
which will then produce words
of the caption one at a time.

1135
00:40:02,888 --> 00:40:04,945
So the way that this kind
of works at test time

1136
00:40:04,945 --> 00:40:06,425
after the model is trained

1137
00:40:06,425 --> 00:40:07,847
looks almost exactly the same

1138
00:40:07,847 --> 00:40:09,665
as these character level language models

1139
00:40:09,665 --> 00:40:11,178
that we saw a little bit ago.

1140
00:40:11,178 --> 00:40:12,699
We'll take our input image,

1141
00:40:12,699 --> 00:40:14,499
feed it through our convolutional network.

1142
00:40:14,499 --> 00:40:16,987
But now instead of
taking the softmax scores

1143
00:40:16,987 --> 00:40:18,638
from an image net model,

1144
00:40:18,638 --> 00:40:22,997
we'll instead take this
4,096 dimensional vector

1145
00:40:22,997 --> 00:40:24,915
from the end of the model,

1146
00:40:24,915 --> 00:40:27,998
and we'll take that vector and use it

1147
00:40:29,038 --> 00:40:31,377
to summarize the whole
content of the image.

1148
00:40:31,377 --> 00:40:34,208
Now, remember when we talked
about RNN language models,

1149
00:40:34,208 --> 00:40:36,379
we said that we need to
see the language model

1150
00:40:36,379 --> 00:40:37,947
with that first initial input

1151
00:40:37,947 --> 00:40:39,528
to tell it to start generating text.

1152
00:40:39,528 --> 00:40:42,118
So in this case, we'll give
it some special start token,

1153
00:40:42,118 --> 00:40:45,236
which is just saying, hey, this
is the start of a sentence.

1154
00:40:45,236 --> 00:40:46,728
Please start generating some text

1155
00:40:46,728 --> 00:40:49,006
conditioned on this image information.

1156
00:40:49,006 --> 00:40:52,338
So now previously, we saw that
in this RNN language model,

1157
00:40:52,338 --> 00:40:55,328
we had these matrices that
were taking the previous,

1158
00:40:55,328 --> 00:40:57,107
the input at the current time step

1159
00:40:57,107 --> 00:40:58,558
and the hidden state of
the previous time step

1160
00:40:58,558 --> 00:41:01,240
and combining those to
get the next hidden state.

1161
00:41:01,240 --> 00:41:04,678
Well now, we also need to add
in this image information.

1162
00:41:04,678 --> 00:41:06,667
So one way, people play
around with exactly

1163
00:41:06,667 --> 00:41:09,288
different ways to incorporate
this image information,

1164
00:41:09,288 --> 00:41:10,688
but one simple way is just to add

1165
00:41:10,688 --> 00:41:13,105
a third weight matrix that is

1166
00:41:14,561 --> 00:41:16,779
adding in this image
information at every time step

1167
00:41:16,779 --> 00:41:18,598
to compute the next hidden state.

1168
00:41:18,598 --> 00:41:21,145
So now, we'll compute this distribution

1169
00:41:21,145 --> 00:41:23,635
over all scores in our vocabulary

1170
00:41:23,635 --> 00:41:25,056
and here, our vocabulary is something

1171
00:41:25,056 --> 00:41:25,907
like all English words,

1172
00:41:25,907 --> 00:41:27,307
so it could be pretty large.

1173
00:41:27,307 --> 00:41:28,699
We'll sample from that distribution

1174
00:41:28,699 --> 00:41:32,099
and now pass that word back as
input at the next time step.

1175
00:41:32,099 --> 00:41:34,488
And that will then feed that word in,

1176
00:41:34,488 --> 00:41:37,418
again get a distribution
over all words in the vocab,

1177
00:41:37,418 --> 00:41:39,546
and again sample to produce the next word.

1178
00:41:39,546 --> 00:41:42,248
So then, after that thing is all done,

1179
00:41:42,248 --> 00:41:44,477
we'll maybe generate,

1180
00:41:44,477 --> 00:41:47,538
we'll generate this complete sentence.

1181
00:41:47,538 --> 00:41:50,956
We stop generation once we
sample the special ends token,

1182
00:41:50,956 --> 00:41:52,887
which kind of corresponds to the period

1183
00:41:52,887 --> 00:41:54,347
at the end of the sentence.

1184
00:41:54,347 --> 00:41:56,419
Then once the network
samples this ends token,

1185
00:41:56,419 --> 00:41:57,859
we stop generation and we're done

1186
00:41:57,859 --> 00:42:00,328
and we've gotten our
caption for this image.

1187
00:42:00,328 --> 00:42:01,798
And now, during training,

1188
00:42:01,798 --> 00:42:03,419
we trained this thing to generate,

1189
00:42:03,419 --> 00:42:04,939
like we put an end token at the end

1190
00:42:04,939 --> 00:42:06,739
of every caption during training

1191
00:42:06,739 --> 00:42:08,408
so that the network kind
of learned during training

1192
00:42:08,408 --> 00:42:10,907
that end tokens come at
the end of sequences.

1193
00:42:10,907 --> 00:42:12,958
So then, during test time,

1194
00:42:12,958 --> 00:42:14,739
it tends to sample these end tokens

1195
00:42:14,739 --> 00:42:16,848
once it's done generating.

1196
00:42:16,848 --> 00:42:18,457
So we trained this model

1197
00:42:18,457 --> 00:42:20,139
in kind of a completely supervised way.

1198
00:42:20,139 --> 00:42:21,547
You can find data sets

1199
00:42:21,547 --> 00:42:24,763
that have images together with
natural language captions.

1200
00:42:24,763 --> 00:42:26,805
Microsoft COCO is probably the biggest

1201
00:42:26,805 --> 00:42:28,666
and most widely used for this task.

1202
00:42:28,666 --> 00:42:30,191
But you can just train this model

1203
00:42:30,191 --> 00:42:31,883
in a purely supervised way.

1204
00:42:31,883 --> 00:42:34,935
And then backpropagate
through to jointly train

1205
00:42:34,935 --> 00:42:37,653
both this recurrent neural
network language model

1206
00:42:37,653 --> 00:42:38,635
and then also pass gradients

1207
00:42:38,635 --> 00:42:40,846
back into this final layer of this the CNN

1208
00:42:40,846 --> 00:42:42,396
and additionally update the weights

1209
00:42:42,396 --> 00:42:45,580
of the CNN to jointly tune
all parts of the model

1210
00:42:45,580 --> 00:42:47,837
to perform this task.

1211
00:42:47,837 --> 00:42:49,528
Once you train these models,

1212
00:42:49,528 --> 00:42:52,438
they actually do some
pretty reasonable things.

1213
00:42:52,438 --> 00:42:54,997
These are some real results from a model,

1214
00:42:54,997 --> 00:42:56,689
from one of these trained models,

1215
00:42:56,689 --> 00:42:58,226
and it says things like a cat sitting

1216
00:42:58,226 --> 00:43:00,995
on a suitcase on the floor,

1217
00:43:00,995 --> 00:43:02,583
which is pretty impressive.

1218
00:43:02,583 --> 00:43:04,975
It knows about cats
sitting on a tree branch,

1219
00:43:04,975 --> 00:43:06,499
which is also pretty cool.

1220
00:43:06,499 --> 00:43:07,796
It knows about two people walking

1221
00:43:07,796 --> 00:43:09,631
on the beach with surfboards.

1222
00:43:09,631 --> 00:43:11,755
So these models are
actually pretty powerful

1223
00:43:11,755 --> 00:43:14,642
and can produce relatively
complex captions

1224
00:43:14,642 --> 00:43:16,635
to describe the image.

1225
00:43:16,635 --> 00:43:17,630
But that being said,

1226
00:43:17,630 --> 00:43:19,808
these models are really not perfect.

1227
00:43:19,808 --> 00:43:21,227
They're not magical.

1228
00:43:21,227 --> 00:43:24,319
Just like any machine learning model,

1229
00:43:24,319 --> 00:43:25,629
if you try to run them on data

1230
00:43:25,629 --> 00:43:27,458
that was very different
from the training data,

1231
00:43:27,458 --> 00:43:28,830
they don't work very well.

1232
00:43:28,830 --> 00:43:30,364
So for example, this example,

1233
00:43:30,364 --> 00:43:33,317
it says a woman is
holding a cat in her hand.

1234
00:43:33,317 --> 00:43:35,421
There's clearly no cat in the image.

1235
00:43:35,421 --> 00:43:36,863
But she is wearing a fur coat,

1236
00:43:36,863 --> 00:43:38,106
and maybe the texture of that coat

1237
00:43:38,106 --> 00:43:41,359
kind of looked like a cat to the model.

1238
00:43:41,359 --> 00:43:42,903
Over here, we see a
woman standing on a beach

1239
00:43:42,903 --> 00:43:44,379
holding a surfboard.

1240
00:43:44,379 --> 00:43:46,744
Well, she's definitely
not holding a surfboard

1241
00:43:46,744 --> 00:43:47,892
and she's doing a handstand,

1242
00:43:47,892 --> 00:43:49,909
which is maybe the interesting
part of that image,

1243
00:43:49,909 --> 00:43:51,820
and the model totally missed that.

1244
00:43:51,820 --> 00:43:53,623
Also, over here, we see this example

1245
00:43:53,623 --> 00:43:56,137
where there's this picture of a spider web

1246
00:43:56,137 --> 00:43:57,168
in the tree branch,

1247
00:43:57,168 --> 00:43:58,238
and it totally,

1248
00:43:58,238 --> 00:43:59,071
and it says something like

1249
00:43:59,071 --> 00:44:00,382
a bird sitting on a tree branch.

1250
00:44:00,382 --> 00:44:01,802
So it totally missed the spider,

1251
00:44:01,802 --> 00:44:03,144
but during training,

1252
00:44:03,144 --> 00:44:05,139
it never really saw examples of spiders.

1253
00:44:05,139 --> 00:44:06,722
It just knows that birds sit

1254
00:44:06,722 --> 00:44:08,217
on tree branches during training.

1255
00:44:08,217 --> 00:44:10,152
So it kind of makes these
reasonable mistakes.

1256
00:44:10,152 --> 00:44:10,985
Or here at the bottom,

1257
00:44:10,985 --> 00:44:12,155
it can't really tell the difference

1258
00:44:12,155 --> 00:44:14,338
between this guy throwing
and catching the ball,

1259
00:44:14,338 --> 00:44:15,712
but it does know that
it's a baseball player

1260
00:44:15,712 --> 00:44:17,709
and there's balls and things involved.

1261
00:44:17,709 --> 00:44:19,347
So again, just want to
say that these models

1262
00:44:19,347 --> 00:44:20,441
are not perfect.

1263
00:44:20,441 --> 00:44:23,313
They work pretty well when
you ask them to caption images

1264
00:44:23,313 --> 00:44:25,109
that were similar to the training data,

1265
00:44:25,109 --> 00:44:26,711
but they definitely have a hard time

1266
00:44:26,711 --> 00:44:29,188
generalizing far beyond that.

1267
00:44:29,188 --> 00:44:30,560
So another thing you'll sometimes see

1268
00:44:30,560 --> 00:44:34,152
is this slightly more advanced
model called Attention,

1269
00:44:34,152 --> 00:44:36,447
where now when we're generating
the words of this caption,

1270
00:44:36,447 --> 00:44:39,120
we can allow the model
to steer it's attention

1271
00:44:39,120 --> 00:44:41,036
to different parts of the image.

1272
00:44:41,036 --> 00:44:44,670
And I don't want to spend
too much time on this.

1273
00:44:44,670 --> 00:44:46,859
But the general way
that this works is that

1274
00:44:46,859 --> 00:44:48,925
now our convolutional network,

1275
00:44:48,925 --> 00:44:50,664
rather than producing a single vector

1276
00:44:50,664 --> 00:44:52,227
summarizing the entire image,

1277
00:44:52,227 --> 00:44:54,274
now it produces some grid of vectors

1278
00:44:54,274 --> 00:44:55,683
that summarize the,

1279
00:44:55,683 --> 00:44:58,503
that give maybe one vector
for each spatial location

1280
00:44:58,503 --> 00:44:59,954
in the image.

1281
00:44:59,954 --> 00:45:00,787
And now, when we,

1282
00:45:00,787 --> 00:45:03,304
when this model runs forward,

1283
00:45:03,304 --> 00:45:06,618
in addition to sampling the
vocabulary at every time step,

1284
00:45:06,618 --> 00:45:08,523
it also produces a distribution

1285
00:45:08,523 --> 00:45:09,964
over the locations in the image

1286
00:45:09,964 --> 00:45:11,263
where it wants to look.

1287
00:45:11,263 --> 00:45:13,581
And now this distribution
over image locations

1288
00:45:13,581 --> 00:45:15,561
can be seen as a kind of a tension

1289
00:45:15,561 --> 00:45:18,276
of where the model should
look during training.

1290
00:45:18,276 --> 00:45:19,397
So now that first hidden state

1291
00:45:19,397 --> 00:45:23,155
computes this distribution
over image locations,

1292
00:45:23,155 --> 00:45:26,316
which then goes back to the set of vectors

1293
00:45:26,316 --> 00:45:27,832
to give a single summary vector

1294
00:45:27,832 --> 00:45:31,372
that maybe focuses the attention
on one part of that image.

1295
00:45:31,372 --> 00:45:32,795
And now that summary vector gets fed,

1296
00:45:32,795 --> 00:45:34,422
as an additional input,

1297
00:45:34,422 --> 00:45:36,508
at the next time step
of the neural network.

1298
00:45:36,508 --> 00:45:38,278
And now again, it will
produce two outputs.

1299
00:45:38,278 --> 00:45:40,414
One is our distribution
over vocabulary words.

1300
00:45:40,414 --> 00:45:43,714
And the other is a distribution
over image locations.

1301
00:45:43,714 --> 00:45:45,402
This whole process will continue,

1302
00:45:45,402 --> 00:45:47,577
and it will sort of do
these two different things

1303
00:45:47,577 --> 00:45:49,160
at every time step.

1304
00:45:50,296 --> 00:45:51,390
And after you train the model,

1305
00:45:51,390 --> 00:45:53,905
then you can see that it kind of

1306
00:45:53,905 --> 00:45:56,104
will shift it's attention around the image

1307
00:45:56,104 --> 00:45:58,298
for every word that it
generates in the caption.

1308
00:45:58,298 --> 00:45:59,827
Here you can see that it

1309
00:45:59,827 --> 00:46:02,723
produced the caption,
a bird is flying over,

1310
00:46:02,723 --> 00:46:04,436
I can't see that far.

1311
00:46:04,436 --> 00:46:06,474
But you can see that its attention

1312
00:46:06,474 --> 00:46:08,556
is shifting around
different parts of the image

1313
00:46:08,556 --> 00:46:12,473
for each word in the
caption that it generates.

1314
00:46:14,112 --> 00:46:15,417
There's this notion of hard attention

1315
00:46:15,417 --> 00:46:16,544
versus soft attention,

1316
00:46:16,544 --> 00:46:19,147
which I don't really want
to get into too much,

1317
00:46:19,147 --> 00:46:21,434
but with this idea of soft attention,

1318
00:46:21,434 --> 00:46:24,405
we're kind of taking
a weighted combination

1319
00:46:24,405 --> 00:46:26,614
of all features from all image locations,

1320
00:46:26,614 --> 00:46:28,275
whereas in the hard attention case,

1321
00:46:28,275 --> 00:46:31,004
we're forcing the model to
select exactly one location

1322
00:46:31,004 --> 00:46:34,259
to look at in the image at each time step.

1323
00:46:34,259 --> 00:46:35,123
So the hard attention case

1324
00:46:35,123 --> 00:46:36,946
where we're selecting
exactly one image location

1325
00:46:36,946 --> 00:46:38,277
is a little bit tricky

1326
00:46:38,277 --> 00:46:41,270
because that is not really
a differentiable function,

1327
00:46:41,270 --> 00:46:42,774
so you need to do
something slightly fancier

1328
00:46:42,774 --> 00:46:44,292
than vanilla backpropagation

1329
00:46:44,292 --> 00:46:46,247
in order to just train the
model in that scenario.

1330
00:46:46,247 --> 00:46:48,145
And I think we'll talk about
that a little bit later

1331
00:46:48,145 --> 00:46:51,498
in the lecture on reinforcement learning.

1332
00:46:51,498 --> 00:46:53,577
Now, when you look at after you train

1333
00:46:53,577 --> 00:46:54,750
one of these attention models

1334
00:46:54,750 --> 00:46:57,415
and then run it on to generate captions,

1335
00:46:57,415 --> 00:46:59,326
you can see that it tends
to focus it's attention

1336
00:46:59,326 --> 00:47:01,980
on maybe the salient or
semanticly meaningful part

1337
00:47:01,980 --> 00:47:04,067
of the image when generating captions.

1338
00:47:04,067 --> 00:47:06,419
You can see that the caption was a woman

1339
00:47:06,419 --> 00:47:08,413
is throwing a frisbee in a park

1340
00:47:08,413 --> 00:47:10,257
and you can see that this attention mask,

1341
00:47:10,257 --> 00:47:11,637
when it generated the word,

1342
00:47:11,637 --> 00:47:13,648
when the model generated the word frisbee,

1343
00:47:13,648 --> 00:47:14,516
at the same time,

1344
00:47:14,516 --> 00:47:17,403
it was focusing it's
attention on this image region

1345
00:47:17,403 --> 00:47:18,976
that actually contains the frisbee.

1346
00:47:18,976 --> 00:47:20,644
This is actually really cool.

1347
00:47:20,644 --> 00:47:23,410
We did not tell the model
where it should be looking

1348
00:47:23,410 --> 00:47:24,542
at every time step.

1349
00:47:24,542 --> 00:47:26,552
It sort of figured all that out for itself

1350
00:47:26,552 --> 00:47:27,955
during the training process.

1351
00:47:27,955 --> 00:47:29,716
Because somehow, it
figured out that looking

1352
00:47:29,716 --> 00:47:31,540
at that image region was
the right thing to do

1353
00:47:31,540 --> 00:47:32,721
for this image.

1354
00:47:32,721 --> 00:47:34,610
And because everything in
this model is differentiable,

1355
00:47:34,610 --> 00:47:35,998
because we can backpropagate

1356
00:47:35,998 --> 00:47:38,161
through all these soft attention steps,

1357
00:47:38,161 --> 00:47:39,592
all of this soft attention stuff

1358
00:47:39,592 --> 00:47:42,541
just comes out through
the training process.

1359
00:47:42,541 --> 00:47:45,297
So that's really, really cool.

1360
00:47:45,297 --> 00:47:47,345
By the way, this idea of
recurrent neural networks

1361
00:47:47,345 --> 00:47:49,975
and attention actually
gets used in other tasks

1362
00:47:49,975 --> 00:47:51,565
beyond image captioning.

1363
00:47:51,565 --> 00:47:53,169
One recent example is this idea

1364
00:47:53,169 --> 00:47:54,691
of visual question answering.

1365
00:47:54,691 --> 00:47:58,195
So here, our model is going
to take two things as input.

1366
00:47:58,195 --> 00:47:59,259
It's going to take an image

1367
00:47:59,259 --> 00:48:01,608
and it will also take a
natural language question

1368
00:48:01,608 --> 00:48:04,010
that's asking some
question about the image.

1369
00:48:04,010 --> 00:48:06,122
Here, we might see this image on the left

1370
00:48:06,122 --> 00:48:07,536
and we might ask the question,

1371
00:48:07,536 --> 00:48:10,163
what endangered animal
is featured on the truck?

1372
00:48:10,163 --> 00:48:11,915
And now the model needs to select from one

1373
00:48:11,915 --> 00:48:13,835
of these four natural language answers

1374
00:48:13,835 --> 00:48:17,327
about which of these answers
correctly answers that question

1375
00:48:17,327 --> 00:48:19,027
in the context of the image.

1376
00:48:19,027 --> 00:48:21,586
So you can imagine kind of
stitching this model together

1377
00:48:21,586 --> 00:48:24,774
using CNNs and RNNs in
kind of a natural way.

1378
00:48:24,774 --> 00:48:26,274
Now, we're in this

1379
00:48:28,125 --> 00:48:29,701
many to one scenario,

1380
00:48:29,701 --> 00:48:31,978
where now our model needs to take as input

1381
00:48:31,978 --> 00:48:33,566
this natural language sequence,

1382
00:48:33,566 --> 00:48:35,552
so we can imagine running
a recurrent neural network

1383
00:48:35,552 --> 00:48:37,485
over each element of that input question,

1384
00:48:37,485 --> 00:48:40,111
to now summarize the input
question in a single vector.

1385
00:48:40,111 --> 00:48:43,928
And then we can have a CNN
to again summarize the image,

1386
00:48:43,928 --> 00:48:46,012
and now combine both
the vector from the CNN

1387
00:48:46,012 --> 00:48:49,531
and the vector from the
question and coding RNN

1388
00:48:49,531 --> 00:48:51,645
to then predict a
distribution over answers.

1389
00:48:51,645 --> 00:48:52,786
We also sometimes,

1390
00:48:52,786 --> 00:48:54,416
you'll also sometimes see this idea

1391
00:48:54,416 --> 00:48:56,275
of soft spacial attention
being incorporated

1392
00:48:56,275 --> 00:48:59,175
into things like visual
question answering.

1393
00:48:59,175 --> 00:49:00,394
So you can see that here,

1394
00:49:00,394 --> 00:49:03,956
this model is also having
the spatial attention

1395
00:49:03,956 --> 00:49:05,342
over the image when it's trying

1396
00:49:05,342 --> 00:49:07,062
to determine answers to the questions.

1397
00:49:07,062 --> 00:49:09,062
Just to, yeah, question?

1398
00:49:18,715 --> 00:49:19,548
So the question is

1399
00:49:19,548 --> 00:49:20,878
How are the different inputs combined?

1400
00:49:20,878 --> 00:49:23,116
Do you mean like the
encoded question vector

1401
00:49:23,116 --> 00:49:25,998
and the encoded image vector?

1402
00:49:25,998 --> 00:49:27,188
Yeah, so the question is

1403
00:49:27,188 --> 00:49:28,445
how are the encoded image

1404
00:49:28,445 --> 00:49:30,395
and the encoded question vector combined?

1405
00:49:30,395 --> 00:49:31,756
Kind of the simplest thing to do

1406
00:49:31,756 --> 00:49:32,885
is just to concatenate them

1407
00:49:32,885 --> 00:49:34,847
and stick them into
fully connected layers.

1408
00:49:34,847 --> 00:49:35,883
That's probably the most common

1409
00:49:35,883 --> 00:49:37,864
and that's probably
the first thing to try.

1410
00:49:37,864 --> 00:49:39,396
Sometimes people do
slightly fancier things

1411
00:49:39,396 --> 00:49:41,673
where they might try to have
multiplicative interactions

1412
00:49:41,673 --> 00:49:42,885
between those two vectors

1413
00:49:42,885 --> 00:49:44,389
to allow a more powerful function.

1414
00:49:44,389 --> 00:49:46,467
But generally, concatenation
is kind of a good

1415
00:49:46,467 --> 00:49:48,050
first thing to try.

1416
00:49:49,426 --> 00:49:51,727
Okay, so now we've talked
about a bunch of scenarios

1417
00:49:51,727 --> 00:49:54,083
where RNNs are used for
different kinds of problems.

1418
00:49:54,083 --> 00:49:55,084
And I think it's super cool

1419
00:49:55,084 --> 00:49:56,758
because it allows you to

1420
00:49:56,758 --> 00:49:58,855
start tackling really complicated problems

1421
00:49:58,855 --> 00:50:02,386
combining images and computer vision

1422
00:50:02,386 --> 00:50:04,072
with natural language processing.

1423
00:50:04,072 --> 00:50:05,794
And you can see that we
can kind of stith together

1424
00:50:05,794 --> 00:50:06,776
these models like Lego blocks

1425
00:50:06,776 --> 00:50:08,870
and attack really complicated things,

1426
00:50:08,870 --> 00:50:11,369
Like image captioning or
visual question answering

1427
00:50:11,369 --> 00:50:14,069
just by stitching together
these relatively simple

1428
00:50:14,069 --> 00:50:16,736
types of neural network modules.

1429
00:50:18,708 --> 00:50:20,861
But I'd also like to mention that

1430
00:50:20,861 --> 00:50:22,359
so far, we've talked about this idea

1431
00:50:22,359 --> 00:50:24,060
of a single recurrent network layer,

1432
00:50:24,060 --> 00:50:26,274
where we have sort of one hidden state,

1433
00:50:26,274 --> 00:50:29,404
and another thing that
you'll see pretty commonly

1434
00:50:29,404 --> 00:50:32,581
is this idea of a multilayer
recurrent neural network.

1435
00:50:32,581 --> 00:50:36,577
Here, this is a three layer
recurrent neural network,

1436
00:50:36,577 --> 00:50:38,535
so now our input goes in,

1437
00:50:38,535 --> 00:50:42,292
goes into, goes in and produces
a sequence of hidden states

1438
00:50:42,292 --> 00:50:44,554
from the first recurrent
neural network layer.

1439
00:50:44,554 --> 00:50:46,309
And now, after we run kind of

1440
00:50:46,309 --> 00:50:47,893
one recurrent neural network layer,

1441
00:50:47,893 --> 00:50:50,468
then we have this whole
sequence of hidden states.

1442
00:50:50,468 --> 00:50:52,991
And now, we can use the
sequence of hidden states

1443
00:50:52,991 --> 00:50:54,733
as an input sequence to another

1444
00:50:54,733 --> 00:50:56,474
recurrent neural network layer.

1445
00:50:56,474 --> 00:50:57,801
And then you can just imagine,

1446
00:50:57,801 --> 00:50:59,757
which will then produce another
sequence of hidden states

1447
00:50:59,757 --> 00:51:01,577
from the second RNN layer.

1448
00:51:01,577 --> 00:51:02,410
And then you can just imagine

1449
00:51:02,410 --> 00:51:04,181
stacking these things
on top of each other,

1450
00:51:04,181 --> 00:51:06,108
cause we know that we've
seen in other contexts

1451
00:51:06,108 --> 00:51:08,253
that deeper models tend to perform better

1452
00:51:08,253 --> 00:51:09,398
for various problems.

1453
00:51:09,398 --> 00:51:11,851
And the same kind of
holds in RNNs as well.

1454
00:51:11,851 --> 00:51:14,377
For many problems, you'll see maybe a two

1455
00:51:14,377 --> 00:51:16,714
or three layer recurrent
neural network model

1456
00:51:16,714 --> 00:51:18,215
is pretty commonly used.

1457
00:51:18,215 --> 00:51:22,086
You typically don't see
super deep models in RNNs.

1458
00:51:22,086 --> 00:51:25,449
So generally, like two,
three, four layer RNNs

1459
00:51:25,449 --> 00:51:29,327
is maybe as deep as you'll typically go.

1460
00:51:29,327 --> 00:51:32,231
Then, I think it's also really
interesting and important

1461
00:51:32,231 --> 00:51:33,500
to think about,

1462
00:51:33,500 --> 00:51:34,714
now we've seen kind of

1463
00:51:34,714 --> 00:51:38,222
what kinds of problems
these RNNs can be used for,

1464
00:51:38,222 --> 00:51:40,079
but then you need to think
a little bit more carefully

1465
00:51:40,079 --> 00:51:41,795
about exactly what happens to these models

1466
00:51:41,795 --> 00:51:43,229
when we try to train them.

1467
00:51:43,229 --> 00:51:45,704
So here, I've drawn this
little vanilla RNN cell

1468
00:51:45,704 --> 00:51:47,447
that we've talked about so far.

1469
00:51:47,447 --> 00:51:50,639
So here, we're taking
our current input, x t,

1470
00:51:50,639 --> 00:51:52,971
and our previous hidden
state, h t minus one,

1471
00:51:52,971 --> 00:51:55,307
and then we stack, those are two vectors.

1472
00:51:55,307 --> 00:51:57,359
So we can just stack them together.

1473
00:51:57,359 --> 00:51:59,033
And then perform this
matrix multiplication

1474
00:51:59,033 --> 00:52:00,431
with our weight matrix,

1475
00:52:00,431 --> 00:52:01,911
to give our,

1476
00:52:01,911 --> 00:52:03,798
and then squash that
output through a tanh,

1477
00:52:03,798 --> 00:52:05,783
and that will give us
our next hidden state.

1478
00:52:05,783 --> 00:52:07,687
And that's kind of the
basic functional form

1479
00:52:07,687 --> 00:52:10,131
of this vanilla recurrent neural network.

1480
00:52:10,131 --> 00:52:11,347
But then, we need to think about

1481
00:52:11,347 --> 00:52:14,080
what happens in this architecture

1482
00:52:14,080 --> 00:52:17,738
during the backward pass when
we try to compute gradients?

1483
00:52:17,738 --> 00:52:19,435
So then if we think
about trying to compute,

1484
00:52:19,435 --> 00:52:21,426
so then during the backwards pass,

1485
00:52:21,426 --> 00:52:24,759
we'll receive the derivative of our h t,

1486
00:52:25,850 --> 00:52:27,075
we'll receive derivative of loss

1487
00:52:27,075 --> 00:52:28,568
with respect to h t.

1488
00:52:28,568 --> 00:52:30,885
And during the backward
pass through the cell,

1489
00:52:30,885 --> 00:52:32,720
we'll need to compute derivative of loss

1490
00:52:32,720 --> 00:52:34,632
to the respect of h t minus one.

1491
00:52:34,632 --> 00:52:37,026
Then, when we compute this backward pass,

1492
00:52:37,026 --> 00:52:38,601
we see that the gradient flows backward

1493
00:52:38,601 --> 00:52:39,881
through this red path.

1494
00:52:39,881 --> 00:52:41,706
So first, that gradient
will flow backwards

1495
00:52:41,706 --> 00:52:43,068
through this tanh gate,

1496
00:52:43,068 --> 00:52:44,036
and then it will flow backwards

1497
00:52:44,036 --> 00:52:45,958
through this matrix multiplication gate.

1498
00:52:45,958 --> 00:52:48,644
And then, as we've seen in the homework

1499
00:52:48,644 --> 00:52:52,054
and when implementing these
matrix multiplication layers,

1500
00:52:52,054 --> 00:52:53,252
when you backpropagate through

1501
00:52:53,252 --> 00:52:54,711
this matrix multiplication gate,

1502
00:52:54,711 --> 00:52:56,785
you end up mulitplying by the transpose

1503
00:52:56,785 --> 00:52:58,457
of that weight matrix.

1504
00:52:58,457 --> 00:53:00,642
So that means that every
time we backpropagate

1505
00:53:00,642 --> 00:53:03,857
through one of these vanilla RNN cells,

1506
00:53:03,857 --> 00:53:08,024
we end up multiplying by some
part of the weight matrix.

1507
00:53:09,110 --> 00:53:11,124
So now if you imagine
that we are sticking many

1508
00:53:11,124 --> 00:53:13,834
of these recurrent neural
network cells in sequence,

1509
00:53:13,834 --> 00:53:15,413
because again this is an RNN.

1510
00:53:15,413 --> 00:53:16,599
We want a model sequences.

1511
00:53:16,599 --> 00:53:18,976
Now if you imagine what
happens to the gradient flow

1512
00:53:18,976 --> 00:53:20,921
through a sequence of these layers,

1513
00:53:20,921 --> 00:53:23,770
then something kind of
fishy starts to happen.

1514
00:53:23,770 --> 00:53:25,207
Because now, when we want to compute

1515
00:53:25,207 --> 00:53:27,863
the gradient of the loss
with respect to h zero,

1516
00:53:27,863 --> 00:53:29,440
we need to backpropagate through every one

1517
00:53:29,440 --> 00:53:30,810
of these RNN cells.

1518
00:53:30,810 --> 00:53:32,903
And every time you
backpropagate through one cell,

1519
00:53:32,903 --> 00:53:35,641
you'll pick up one of
these w transpose factors.

1520
00:53:35,641 --> 00:53:38,176
So that means that the final expression

1521
00:53:38,176 --> 00:53:40,289
for the gradient on h zero

1522
00:53:40,289 --> 00:53:42,161
will involve many, many factors

1523
00:53:42,161 --> 00:53:43,550
of this weight matrix,

1524
00:53:43,550 --> 00:53:44,967
which could be kind of bad.

1525
00:53:44,967 --> 00:53:46,976
Maybe don't think about the weight,

1526
00:53:46,976 --> 00:53:48,329
the matrix case,

1527
00:53:48,329 --> 00:53:50,138
but imagine a scaler case.

1528
00:53:50,138 --> 00:53:51,877
If we end up, if we have some scaler

1529
00:53:51,877 --> 00:53:53,662
and we multiply by that
same number over and over

1530
00:53:53,662 --> 00:53:54,563
and over again,

1531
00:53:54,563 --> 00:53:56,038
maybe not for four examples,

1532
00:53:56,038 --> 00:53:57,133
but for something like a hundred

1533
00:53:57,133 --> 00:54:00,269
or several hundred time steps,

1534
00:54:00,269 --> 00:54:01,832
then multiplying by the same number

1535
00:54:01,832 --> 00:54:03,835
over and over again is really bad.

1536
00:54:03,835 --> 00:54:05,244
In the scaler case,

1537
00:54:05,244 --> 00:54:06,756
it's either going to explode

1538
00:54:06,756 --> 00:54:08,971
in the case that that
number is greater than one

1539
00:54:08,971 --> 00:54:10,504
or it's going to vanish towards zero

1540
00:54:10,504 --> 00:54:12,912
in the case that number is less than one

1541
00:54:12,912 --> 00:54:14,069
in absolute value.

1542
00:54:14,069 --> 00:54:16,355
And the only way in which
this will not happen

1543
00:54:16,355 --> 00:54:18,186
is if that number is exactly one,

1544
00:54:18,186 --> 00:54:20,911
which is actually very
rare to happen in practice.

1545
00:54:20,911 --> 00:54:22,569
That leaves us to,

1546
00:54:22,569 --> 00:54:25,229
that same intuition
extends to the matrix case,

1547
00:54:25,229 --> 00:54:27,560
but now, rather than the absolute
value of a scaler number,

1548
00:54:27,560 --> 00:54:29,261
you instead need to look at the largest,

1549
00:54:29,261 --> 00:54:32,071
the largest singular value
of this weight matrix.

1550
00:54:32,071 --> 00:54:34,707
Now if that largest singular
value is greater than one,

1551
00:54:34,707 --> 00:54:36,502
then during this backward pass,

1552
00:54:36,502 --> 00:54:38,824
when we multiply by the
weight matrix over and over,

1553
00:54:38,824 --> 00:54:42,135
that gradient on h w, on h zero, sorry,

1554
00:54:42,135 --> 00:54:43,915
will become very, very large,

1555
00:54:43,915 --> 00:54:46,079
when that matrix is too large.

1556
00:54:46,079 --> 00:54:48,947
And that's something we call
the exploding gradient problem.

1557
00:54:48,947 --> 00:54:52,047
Where now this gradient will
explode exponentially in depth

1558
00:54:52,047 --> 00:54:53,427
with the number of time steps

1559
00:54:53,427 --> 00:54:55,061
that we backpropagate through.

1560
00:54:55,061 --> 00:54:57,643
And if the largest singular
value is less than one,

1561
00:54:57,643 --> 00:54:59,123
then we get the opposite problem,

1562
00:54:59,123 --> 00:55:00,036
where now our gradients will shrink

1563
00:55:00,036 --> 00:55:01,563
and shrink and shrink exponentially,

1564
00:55:01,563 --> 00:55:03,945
as we backpropagate and pick
up more and more factors

1565
00:55:03,945 --> 00:55:05,349
of this weight matrix.

1566
00:55:05,349 --> 00:55:08,863
That's called the
vanishing gradient problem.

1567
00:55:08,863 --> 00:55:11,172
THere's a bit of a hack
that people sometimes do

1568
00:55:11,172 --> 00:55:12,836
to fix the exploding gradient problem

1569
00:55:12,836 --> 00:55:14,208
called gradient clipping,

1570
00:55:14,208 --> 00:55:16,264
which is just this simple heuristic

1571
00:55:16,264 --> 00:55:18,660
saying that after we compute our gradient,

1572
00:55:18,660 --> 00:55:19,713
if that gradient,

1573
00:55:19,713 --> 00:55:22,156
if it's L2 norm is above some threshold,

1574
00:55:22,156 --> 00:55:24,112
then just clamp it down and divide,

1575
00:55:24,112 --> 00:55:27,955
just clamp it down so it
has this maximum threshold.

1576
00:55:27,955 --> 00:55:29,271
This is kind of a nasty hack,

1577
00:55:29,271 --> 00:55:30,964
but it actually gets used
in practice quite a lot

1578
00:55:30,964 --> 00:55:32,645
when training recurrent neural networks.

1579
00:55:32,645 --> 00:55:34,517
And it's a relatively useful tool

1580
00:55:34,517 --> 00:55:39,034
for attacking this
exploding gradient problem.

1581
00:55:39,034 --> 00:55:40,994
But now for the vanishing
gradient problem,

1582
00:55:40,994 --> 00:55:42,254
what we typically do

1583
00:55:42,254 --> 00:55:43,640
is we might need to move to

1584
00:55:43,640 --> 00:55:46,207
a more complicated RNN architecture.

1585
00:55:46,207 --> 00:55:48,939
So that motivates this idea of an LSTM.

1586
00:55:48,939 --> 00:55:53,524
An LSTM, which stands for
Long Short Term Memory,

1587
00:55:53,524 --> 00:55:55,700
is this slightly fancier
recurrence relation

1588
00:55:55,700 --> 00:55:58,316
for these recurrent neural networks.

1589
00:55:58,316 --> 00:55:59,984
It's really designed to help alleviate

1590
00:55:59,984 --> 00:56:03,330
this problem of vanishing
and exploding gradients.

1591
00:56:03,330 --> 00:56:05,591
So that rather than kind
of hacking on top of it,

1592
00:56:05,591 --> 00:56:07,794
we just kind of design the architecture

1593
00:56:07,794 --> 00:56:10,193
to have better gradient flow properties.

1594
00:56:10,193 --> 00:56:13,566
Kind of an analogy to those
fancier CNN architectures

1595
00:56:13,566 --> 00:56:16,556
that we saw at the top of the lecture.

1596
00:56:16,556 --> 00:56:17,389
Another thing to point out

1597
00:56:17,389 --> 00:56:20,807
is that the LSTM cell
actually comes from 1997.

1598
00:56:20,807 --> 00:56:22,337
So this idea of an LSTM

1599
00:56:22,337 --> 00:56:24,073
has been around for quite a while,

1600
00:56:24,073 --> 00:56:26,108
and these folks were
working on these ideas

1601
00:56:26,108 --> 00:56:27,110
way back in the 90s,

1602
00:56:27,110 --> 00:56:28,852
were definitely ahead of the curve.

1603
00:56:28,852 --> 00:56:31,225
Because these models are
kind of used everywhere now

1604
00:56:31,225 --> 00:56:32,475
20 years later.

1605
00:56:33,864 --> 00:56:37,921
And LSTMs kind of have
this funny functional form.

1606
00:56:37,921 --> 00:56:39,814
So remember when we had this vanilla

1607
00:56:39,814 --> 00:56:41,198
recurrent neural network,

1608
00:56:41,198 --> 00:56:42,538
it had this hidden state.

1609
00:56:42,538 --> 00:56:44,056
And we used this recurrence relation

1610
00:56:44,056 --> 00:56:46,397
to update the hidden
state at every time step.

1611
00:56:46,397 --> 00:56:47,570
Well, now in an LSTM,

1612
00:56:47,570 --> 00:56:48,842
we actually have two,

1613
00:56:48,842 --> 00:56:51,462
we maintain two hidden
states at every time step.

1614
00:56:51,462 --> 00:56:52,931
One is this h t,

1615
00:56:52,931 --> 00:56:54,439
which is called the hidden state,

1616
00:56:54,439 --> 00:56:57,334
which is kind of an
analogy to the hidden state

1617
00:56:57,334 --> 00:56:59,305
that we had in the vanilla RNN.

1618
00:56:59,305 --> 00:57:02,524
But an LSTM also maintains
the second vector, c t,

1619
00:57:02,524 --> 00:57:03,604
called the cell state.

1620
00:57:03,604 --> 00:57:06,459
And the cell state is this
vector which is kind of internal,

1621
00:57:06,459 --> 00:57:08,546
kept inside the LSTM,

1622
00:57:08,546 --> 00:57:12,240
and it does not really get
exposed to the outside world.

1623
00:57:12,240 --> 00:57:13,170
And we'll see,

1624
00:57:13,170 --> 00:57:15,260
and you can kind of see that
through this update equation,

1625
00:57:15,260 --> 00:57:17,371
where you can see that when we,

1626
00:57:17,371 --> 00:57:19,235
first when we compute these,

1627
00:57:19,235 --> 00:57:20,803
we take our two inputs,

1628
00:57:20,803 --> 00:57:23,966
we use them to compute these four gates

1629
00:57:23,966 --> 00:57:25,485
called i, f, o, n, g.

1630
00:57:25,485 --> 00:57:28,634
We use those gates to
update our cell states, c t,

1631
00:57:28,634 --> 00:57:30,302
and then we expose part of our cell state

1632
00:57:30,302 --> 00:57:33,802
as the hidden state at the next time step.

1633
00:57:36,704 --> 00:57:38,282
This is kind of a funny functional form,

1634
00:57:38,282 --> 00:57:40,227
and I want to walk through
for a couple slides

1635
00:57:40,227 --> 00:57:42,407
exactly why do we use this architecture

1636
00:57:42,407 --> 00:57:44,014
and why does it make sense,

1637
00:57:44,014 --> 00:57:45,418
especially in the context

1638
00:57:45,418 --> 00:57:47,731
of vanishing or exploding gradients.

1639
00:57:47,731 --> 00:57:51,044
This first thing that we do in an LSTM

1640
00:57:51,044 --> 00:57:54,534
is that we're given this
previous hidden state, h t,

1641
00:57:54,534 --> 00:57:57,213
and we're given our
current input vector, x t,

1642
00:57:57,213 --> 00:57:58,611
and just like the vanilla RNN.

1643
00:57:58,611 --> 00:58:00,251
In the vanilla RNN, remember,

1644
00:58:00,251 --> 00:58:02,617
we took those two input vectors.

1645
00:58:02,617 --> 00:58:03,908
We concatenated them.

1646
00:58:03,908 --> 00:58:05,098
Then we did a matrix multiply

1647
00:58:05,098 --> 00:58:08,663
to directly compute the next
hidden state in the RNN.

1648
00:58:08,663 --> 00:58:10,751
Now, the LSTM does something
a little bit different.

1649
00:58:10,751 --> 00:58:13,055
We're going to take our
previous hidden state

1650
00:58:13,055 --> 00:58:14,429
and our current input,

1651
00:58:14,429 --> 00:58:15,315
stack them,

1652
00:58:15,315 --> 00:58:18,748
and now multiply by a
very big weight matrix, w,

1653
00:58:18,748 --> 00:58:21,926
to compute four different gates,

1654
00:58:21,926 --> 00:58:24,391
Which all have the same
size as the hidden state.

1655
00:58:24,391 --> 00:58:26,193
Sometimes, you'll see this
written in different ways.

1656
00:58:26,193 --> 00:58:29,549
Some authors will write
a different weight matrix

1657
00:58:29,549 --> 00:58:30,808
for each gate.

1658
00:58:30,808 --> 00:58:31,793
Some authors will combine them all

1659
00:58:31,793 --> 00:58:33,160
into one big weight matrix.

1660
00:58:33,160 --> 00:58:34,677
But it's all really the same thing.

1661
00:58:34,677 --> 00:58:36,334
The ideas is that we
take our hidden state,

1662
00:58:36,334 --> 00:58:37,282
our current input,

1663
00:58:37,282 --> 00:58:40,288
and then we use those to
compute these four gates.

1664
00:58:40,288 --> 00:58:42,163
These four gates are the,

1665
00:58:42,163 --> 00:58:45,725
you often see this written
as i, f, o, g, ifog,

1666
00:58:45,725 --> 00:58:47,768
which makes it pretty easy
to remember what they are.

1667
00:58:47,768 --> 00:58:49,435
I is the input gate.

1668
00:58:50,324 --> 00:58:53,655
It says how much do we want
to input into our cell.

1669
00:58:53,655 --> 00:58:55,431
F is the forget gate.

1670
00:58:55,431 --> 00:58:57,860
How much do we want to
forget the cell memory

1671
00:58:57,860 --> 00:59:00,194
at the previous, from
the previous time step.

1672
00:59:00,194 --> 00:59:01,333
O is the output gate,

1673
00:59:01,333 --> 00:59:03,327
which is how much do we
want to reveal ourself

1674
00:59:03,327 --> 00:59:04,653
to the outside world.

1675
00:59:04,653 --> 00:59:07,092
And G really doesn't have a nice name,

1676
00:59:07,092 --> 00:59:10,130
so I usually call it the gate gate.

1677
00:59:10,130 --> 00:59:13,109
G, it tells us how much
do we want to write

1678
00:59:13,109 --> 00:59:14,626
into our input cell.

1679
00:59:14,626 --> 00:59:16,665
And then you notice that
each of these four gates

1680
00:59:16,665 --> 00:59:19,665
are using a different non linearity.

1681
00:59:21,724 --> 00:59:23,809
The input, forget and output gate

1682
00:59:23,809 --> 00:59:25,047
are all using sigmoids,

1683
00:59:25,047 --> 00:59:28,571
which means that their values
will be between zero and one.

1684
00:59:28,571 --> 00:59:31,109
Whereas the gate gate uses a tanh,

1685
00:59:31,109 --> 00:59:34,316
which means it's output will
be between minus one and one.

1686
00:59:34,316 --> 00:59:36,810
So, these are kind of weird,

1687
00:59:36,810 --> 00:59:38,551
but it makes a little bit more sense

1688
00:59:38,551 --> 00:59:41,725
if you imagine them all as binary values.

1689
00:59:41,725 --> 00:59:43,077
Right, like what happens at the extremes

1690
00:59:43,077 --> 00:59:44,744
of these two values?

1691
00:59:46,033 --> 00:59:46,866
It's kind of what happens,

1692
00:59:46,866 --> 00:59:48,490
if you look after we compute these gates

1693
00:59:48,490 --> 00:59:50,200
if you look at this next equation,

1694
00:59:50,200 --> 00:59:51,423
you can see that our cell state

1695
00:59:51,423 --> 00:59:53,926
is being multiplied element
wise by the forget gate.

1696
00:59:53,926 --> 00:59:56,269
Sorry, our cell state from
the previous time step

1697
00:59:56,269 --> 00:59:59,305
is being multiplied element
wise by this forget gate.

1698
00:59:59,305 --> 01:00:00,350
And now if this forget gate,

1699
01:00:00,350 --> 01:00:03,861
you can think of it as being
a vector of zeros and ones,

1700
01:00:03,861 --> 01:00:06,162
that's telling us for each
element in the cell state,

1701
01:00:06,162 --> 01:00:08,416
do we want to forget
that element of the cell

1702
01:00:08,416 --> 01:00:10,644
in the case if the forget gate was zero?

1703
01:00:10,644 --> 01:00:12,819
Or do we want to remember
that element of the cell

1704
01:00:12,819 --> 01:00:14,935
in the case if the forget gate was one.

1705
01:00:14,935 --> 01:00:17,258
Now, once we've used the forget gate

1706
01:00:17,258 --> 01:00:19,767
to gate off the part of the cell state,

1707
01:00:19,767 --> 01:00:21,088
then we have the second term,

1708
01:00:21,088 --> 01:00:24,814
which is the element
wise product of i and g.

1709
01:00:24,814 --> 01:00:27,431
So now, i is this vector
of zeros and ones,

1710
01:00:27,431 --> 01:00:28,824
cause it's coming through a sigmoid,

1711
01:00:28,824 --> 01:00:31,480
telling us for each
element of the cell state,

1712
01:00:31,480 --> 01:00:33,909
do we want to write to that
element of the cell state

1713
01:00:33,909 --> 01:00:35,728
in the case that i is one,

1714
01:00:35,728 --> 01:00:37,771
or do we not want to write to
that element of the cell state

1715
01:00:37,771 --> 01:00:38,687
at this time step

1716
01:00:38,687 --> 01:00:41,719
in the case that i is zero.

1717
01:00:41,719 --> 01:00:42,585
And now the gate gate,

1718
01:00:42,585 --> 01:00:44,254
because it's coming through a tanh,

1719
01:00:44,254 --> 01:00:46,031
will be either one or minus one.

1720
01:00:46,031 --> 01:00:47,495
So that is the value that we want,

1721
01:00:47,495 --> 01:00:50,500
the candidate value that
we might consider writing

1722
01:00:50,500 --> 01:00:54,022
to each element of the cell
state at this time step.

1723
01:00:54,022 --> 01:00:56,098
Then if you look at the
cell state equation,

1724
01:00:56,098 --> 01:00:58,157
you can see that at every time step,

1725
01:00:58,157 --> 01:00:59,963
the cell state has these kind of

1726
01:00:59,963 --> 01:01:02,499
these different,
independent scaler values,

1727
01:01:02,499 --> 01:01:05,230
and they're all being incremented
or decremented by one.

1728
01:01:05,230 --> 01:01:07,466
So there's kind of like,

1729
01:01:07,466 --> 01:01:09,560
inside the cell state,
we can either remember

1730
01:01:09,560 --> 01:01:11,028
or forget our previous state,

1731
01:01:11,028 --> 01:01:13,480
and then we can either
increment or decrement

1732
01:01:13,480 --> 01:01:14,765
each element of that cell state

1733
01:01:14,765 --> 01:01:16,535
by up to one at each time step.

1734
01:01:16,535 --> 01:01:19,430
So you can kind of think of
these elements of the cell state

1735
01:01:19,430 --> 01:01:22,189
as being little scaler integer counters

1736
01:01:22,189 --> 01:01:24,061
that can be incremented and decremented

1737
01:01:24,061 --> 01:01:25,708
at each time step.

1738
01:01:25,708 --> 01:01:28,489
And now, after we've
computed our cell state,

1739
01:01:28,489 --> 01:01:31,624
then we use our now updated cell state

1740
01:01:31,624 --> 01:01:32,874
to compute a hidden state,

1741
01:01:32,874 --> 01:01:36,513
which we will reveal to the outside world.

1742
01:01:36,513 --> 01:01:38,832
So because this cell state
has this interpretation

1743
01:01:38,832 --> 01:01:39,884
of being counters,

1744
01:01:39,884 --> 01:01:41,280
and sort of counting up by one

1745
01:01:41,280 --> 01:01:43,353
or minus one at each time step,

1746
01:01:43,353 --> 01:01:45,518
we want to squash that counter value

1747
01:01:45,518 --> 01:01:48,895
into a nice zero to
one range using a tanh.

1748
01:01:48,895 --> 01:01:50,483
And now, we multiply element wise,

1749
01:01:50,483 --> 01:01:51,826
by this output gate.

1750
01:01:51,826 --> 01:01:54,441
And the output gate is again
coming through a sigmoid,

1751
01:01:54,441 --> 01:01:57,597
so you can think of it as
being mostly zeros and ones,

1752
01:01:57,597 --> 01:01:58,947
and the output gate tells us

1753
01:01:58,947 --> 01:02:00,826
for each element of our cell state,

1754
01:02:00,826 --> 01:02:02,773
do we want to reveal or not reveal

1755
01:02:02,773 --> 01:02:04,614
that element of our cell state

1756
01:02:04,614 --> 01:02:06,994
when we're computing the
external hidden state

1757
01:02:06,994 --> 01:02:08,577
for this time step.

1758
01:02:09,736 --> 01:02:11,609
And then, I think there's
kind of a tradition

1759
01:02:11,609 --> 01:02:13,479
in people trying to explain LSTMs,

1760
01:02:13,479 --> 01:02:14,524
that everyone needs to come up

1761
01:02:14,524 --> 01:02:17,132
with their own potentially
confusing LSTM diagram.

1762
01:02:17,132 --> 01:02:18,882
So here's my attempt.

1763
01:02:20,380 --> 01:02:23,866
Here, we can see what's going
on inside this LSTM cell,

1764
01:02:23,866 --> 01:02:24,865
is that we take our,

1765
01:02:24,865 --> 01:02:27,566
we're taking as input on the
left our previous cell state

1766
01:02:27,566 --> 01:02:28,937
and the previous hidden state,

1767
01:02:28,937 --> 01:02:31,266
as well as our current input, x t.

1768
01:02:31,266 --> 01:02:32,537
Now we're going to take our current,

1769
01:02:32,537 --> 01:02:34,985
our previous hidden state,

1770
01:02:34,985 --> 01:02:36,453
as well as our current input,

1771
01:02:36,453 --> 01:02:37,346
stack them,

1772
01:02:37,346 --> 01:02:39,526
and then multiply with
this weight matrix, w,

1773
01:02:39,526 --> 01:02:41,166
to produce our four gates.

1774
01:02:41,166 --> 01:02:42,627
And here, I've left
out the non linearities

1775
01:02:42,627 --> 01:02:44,836
because we saw those on a previous slide.

1776
01:02:44,836 --> 01:02:47,294
And now the forget gate
multiplies element wise

1777
01:02:47,294 --> 01:02:48,143
with the cell state.

1778
01:02:48,143 --> 01:02:51,174
The input and gate gate
are multiplied element wise

1779
01:02:51,174 --> 01:02:52,689
and added to the cell state.

1780
01:02:52,689 --> 01:02:54,524
And that gives us our next cell.

1781
01:02:54,524 --> 01:02:56,533
The next cell gets
squashed through a tanh,

1782
01:02:56,533 --> 01:02:58,616
and multiplied element
wise with this output gate

1783
01:02:58,616 --> 01:03:01,366
to produce our next hidden state.

1784
01:03:02,417 --> 01:03:03,250
Question?

1785
01:03:13,116 --> 01:03:14,587
No, So they're coming through this,

1786
01:03:14,587 --> 01:03:17,878
they're coming from different
parts of this weight matrix.

1787
01:03:17,878 --> 01:03:19,147
So if our hidden,

1788
01:03:19,147 --> 01:03:23,343
if our x and our h all
have this dimension h,

1789
01:03:23,343 --> 01:03:24,300
then after we stack them,

1790
01:03:24,300 --> 01:03:26,415
they'll be a vector size two h,

1791
01:03:26,415 --> 01:03:28,718
and now our weight matrix
will be this matrix

1792
01:03:28,718 --> 01:03:30,393
of size four h times two h.

1793
01:03:30,393 --> 01:03:31,873
So you can think of that as sort of having

1794
01:03:31,873 --> 01:03:34,076
four chunks of this weight matrix.

1795
01:03:34,076 --> 01:03:37,344
And each of these four
chunks of the weight matrix

1796
01:03:37,344 --> 01:03:41,511
is going to compute a
different one of these gates.

1797
01:03:42,404 --> 01:03:44,673
You'll often see this written for clarity,

1798
01:03:44,673 --> 01:03:46,449
kind of combining all
four of those different

1799
01:03:46,449 --> 01:03:49,161
weight matrices into a
single large matrix, w,

1800
01:03:49,161 --> 01:03:51,109
just for notational convenience.

1801
01:03:51,109 --> 01:03:52,484
But they're all computed

1802
01:03:52,484 --> 01:03:56,067
using different parts
of the weight matrix.

1803
01:03:57,080 --> 01:03:58,645
But you're correct in
that they're all computed

1804
01:03:58,645 --> 01:04:00,458
using the same functional form

1805
01:04:00,458 --> 01:04:01,658
of just stacking the two things

1806
01:04:01,658 --> 01:04:04,574
and taking the matrix multiplication.

1807
01:04:04,574 --> 01:04:06,393
Now that we have this picture,

1808
01:04:06,393 --> 01:04:09,519
we can think about what
happens to an LSTM cell

1809
01:04:09,519 --> 01:04:11,196
during the backwards pass?

1810
01:04:11,196 --> 01:04:12,795
We saw, in the context of vanilla

1811
01:04:12,795 --> 01:04:13,738
recurrent neural network,

1812
01:04:13,738 --> 01:04:15,803
that some bad things happened
during the backwards pass,

1813
01:04:15,803 --> 01:04:16,958
where we were continually multiplying

1814
01:04:16,958 --> 01:04:18,999
by that weight matrix, w.

1815
01:04:18,999 --> 01:04:20,555
But now, the situation looks much,

1816
01:04:20,555 --> 01:04:23,238
quite a bit different in the LSTM.

1817
01:04:23,238 --> 01:04:26,468
If you imagine this path backwards

1818
01:04:26,468 --> 01:04:28,323
of computing the gradients
of the cell state,

1819
01:04:28,323 --> 01:04:29,952
we get quite a nice picture.

1820
01:04:29,952 --> 01:04:32,266
Now, when we have our upstream gradient

1821
01:04:32,266 --> 01:04:33,320
from the cell coming in,

1822
01:04:33,320 --> 01:04:36,281
then once we backpropagate backwards

1823
01:04:36,281 --> 01:04:37,737
through this addition operation,

1824
01:04:37,737 --> 01:04:40,572
remember that this addition just copies

1825
01:04:40,572 --> 01:04:43,663
that upstream gradient
into the two branches,

1826
01:04:43,663 --> 01:04:45,641
so our upstream gradient
gets copied directly

1827
01:04:45,641 --> 01:04:47,954
and passed directly to backpropagating

1828
01:04:47,954 --> 01:04:50,435
through this element wise multiply.

1829
01:04:50,435 --> 01:04:52,192
So then our upstream
gradient ends up getting

1830
01:04:52,192 --> 01:04:56,455
multiplied element wise
by the forget gate.

1831
01:04:56,455 --> 01:05:00,364
As we backpropagate backwards
through this cell state,

1832
01:05:00,364 --> 01:05:01,397
the only thing that happens

1833
01:05:01,397 --> 01:05:03,818
to our upstream cell state gradient

1834
01:05:03,818 --> 01:05:05,935
is that it ends up getting
multiplied element wise

1835
01:05:05,935 --> 01:05:07,171
by the forget gate.

1836
01:05:07,171 --> 01:05:09,939
This is really a lot nicer

1837
01:05:09,939 --> 01:05:12,640
than the vanilla RNN for two reasons.

1838
01:05:12,640 --> 01:05:14,318
One is that this forget gate

1839
01:05:14,318 --> 01:05:16,488
is now an element wise multiplication

1840
01:05:16,488 --> 01:05:18,498
rather than a full matrix multiplication.

1841
01:05:18,498 --> 01:05:19,923
So element wise multiplication

1842
01:05:19,923 --> 01:05:23,205
is going to be a little bit nicer

1843
01:05:23,205 --> 01:05:24,964
than full matrix multiplication.

1844
01:05:24,964 --> 01:05:27,208
Second is that element wise multiplication

1845
01:05:27,208 --> 01:05:29,710
will potentially be
multiplying by a different

1846
01:05:29,710 --> 01:05:31,354
forget gate at every time step.

1847
01:05:31,354 --> 01:05:33,087
So remember, in the vanilla RNN,

1848
01:05:33,087 --> 01:05:35,638
we were continually multiplying
by that same weight matrix

1849
01:05:35,638 --> 01:05:36,660
over and over again,

1850
01:05:36,660 --> 01:05:38,305
which led very explicitly

1851
01:05:38,305 --> 01:05:40,563
to these exploding or vanishing gradients.

1852
01:05:40,563 --> 01:05:41,943
But now in the LSTM case,

1853
01:05:41,943 --> 01:05:45,161
this forget gate can
vary from each time step.

1854
01:05:45,161 --> 01:05:47,463
Now, it's much easier for the model

1855
01:05:47,463 --> 01:05:49,560
to avoid these problems

1856
01:05:49,560 --> 01:05:51,670
of exploding and vanishing gradients.

1857
01:05:51,670 --> 01:05:53,377
Finally, because this forget gate

1858
01:05:53,377 --> 01:05:54,902
is coming out from a sigmoid,

1859
01:05:54,902 --> 01:05:56,178
this element wise multiply

1860
01:05:56,178 --> 01:05:58,438
is guaranteed to be between zero and one,

1861
01:05:58,438 --> 01:06:00,868
which again, leads to sort
of nicer numerical properties

1862
01:06:00,868 --> 01:06:02,457
if you imagine multiplying by these things

1863
01:06:02,457 --> 01:06:04,278
over and over again.

1864
01:06:04,278 --> 01:06:07,063
Another thing to notice
is that in the context

1865
01:06:07,063 --> 01:06:08,909
of the vanilla recurrent neural network,

1866
01:06:08,909 --> 01:06:10,239
we saw that during the backward pass,

1867
01:06:10,239 --> 01:06:13,146
our gradients were flowing
through also a tanh

1868
01:06:13,146 --> 01:06:14,273
at every time step.

1869
01:06:14,273 --> 01:06:15,940
But now, in an LSTM,

1870
01:06:17,459 --> 01:06:18,792
our outputs are,

1871
01:06:20,790 --> 01:06:22,452
in an LSTM, our hidden state is used

1872
01:06:22,452 --> 01:06:24,110
to compute those outputs, y t,

1873
01:06:24,110 --> 01:06:26,141
so now, each hidden state,

1874
01:06:26,141 --> 01:06:29,229
if you imagine backpropagating
from the final hidden state

1875
01:06:29,229 --> 01:06:31,290
back to the first cell state,

1876
01:06:31,290 --> 01:06:33,024
then through that backward path,

1877
01:06:33,024 --> 01:06:37,396
we only backpropagate through
a single tanh non linearity

1878
01:06:37,396 --> 01:06:42,044
rather than through a separate
tanh at every time step.

1879
01:06:42,044 --> 01:06:44,059
So kind of when you put
all these things together,

1880
01:06:44,059 --> 01:06:45,927
you can see this backwards pass

1881
01:06:45,927 --> 01:06:48,646
backpropagating through the cell state

1882
01:06:48,646 --> 01:06:50,483
is kind of a gradient super highway

1883
01:06:50,483 --> 01:06:53,205
that lets gradients pass
relatively unimpeded

1884
01:06:53,205 --> 01:06:55,047
from the loss at the very end of the model

1885
01:06:55,047 --> 01:06:56,765
all the way back to the initial cell state

1886
01:06:56,765 --> 01:06:59,172
at the beginning of the model.

1887
01:06:59,172 --> 01:07:00,922
Was there a question?

1888
01:07:02,901 --> 01:07:04,693
Yeah, what about the
gradient in respect to w?

1889
01:07:04,693 --> 01:07:06,792
'Cause that's ultimately the
thing that we care about.

1890
01:07:06,792 --> 01:07:08,752
So, the gradient with respect to w

1891
01:07:08,752 --> 01:07:10,252
will come through,

1892
01:07:11,737 --> 01:07:12,570
at every time step,

1893
01:07:12,570 --> 01:07:13,771
will take our current cell state

1894
01:07:13,771 --> 01:07:15,059
as well as our current hidden state

1895
01:07:15,059 --> 01:07:16,272
and that will give us an element,

1896
01:07:16,272 --> 01:07:18,480
that will give us our local gradient on w

1897
01:07:18,480 --> 01:07:19,848
for that time step.

1898
01:07:19,848 --> 01:07:21,340
So because our cell state,

1899
01:07:21,340 --> 01:07:23,791
and just in the vanilla RNN case,

1900
01:07:23,791 --> 01:07:27,307
we'll end up adding those
first time step w gradients

1901
01:07:27,307 --> 01:07:29,587
to compute our final gradient on w.

1902
01:07:29,587 --> 01:07:33,139
But now, if you imagine the situation

1903
01:07:33,139 --> 01:07:34,961
where we have a very long sequence,

1904
01:07:34,961 --> 01:07:36,405
and we're only getting
gradients to the very end

1905
01:07:36,405 --> 01:07:37,280
of the sequence.

1906
01:07:37,280 --> 01:07:38,945
Now, as you backpropagate through,

1907
01:07:38,945 --> 01:07:40,978
we'll get a local gradient on w

1908
01:07:40,978 --> 01:07:43,219
for each time step,

1909
01:07:43,219 --> 01:07:44,484
and that local gradient on w

1910
01:07:44,484 --> 01:07:48,506
will be coming through
these gradients on c and h.

1911
01:07:48,506 --> 01:07:50,994
So because we're maintaining
the gradients on c

1912
01:07:50,994 --> 01:07:52,751
much more nicely in the LSTM case,

1913
01:07:52,751 --> 01:07:54,988
those local gradients
on w at each time step

1914
01:07:54,988 --> 01:07:57,221
will also be carried forward and backward

1915
01:07:57,221 --> 01:07:59,804
through time much more cleanly.

1916
01:08:01,627 --> 01:08:03,044
Another question?

1917
01:08:17,428 --> 01:08:18,645
Yeah, so the question is

1918
01:08:18,645 --> 01:08:19,886
due to the non linearities,

1919
01:08:19,886 --> 01:08:22,088
could this still be susceptible
to vanishing gradients?

1920
01:08:22,089 --> 01:08:24,077
And that could be the case.

1921
01:08:24,077 --> 01:08:26,176
Actually, so one problem you might imagine

1922
01:08:26,176 --> 01:08:27,560
is that maybe if these forget gates

1923
01:08:27,560 --> 01:08:29,322
are always less than zero,

1924
01:08:29,323 --> 01:08:30,252
or always less than one,

1925
01:08:30,252 --> 01:08:31,411
you might get vanishing gradients

1926
01:08:31,411 --> 01:08:34,103
as you continually go
through these forget gates.

1927
01:08:34,103 --> 01:08:35,960
Well, one sort of trick
that people do in practice

1928
01:08:35,960 --> 01:08:38,513
is that they will, sometimes,

1929
01:08:38,513 --> 01:08:40,689
initialize the biases of the forget gate

1930
01:08:40,689 --> 01:08:42,746
to be somewhat positive.

1931
01:08:42,746 --> 01:08:44,004
So that at the beginning of training,

1932
01:08:44,004 --> 01:08:46,305
those forget gates are
always very close to one.

1933
01:08:46,305 --> 01:08:48,118
So that at least at the
beginning of training,

1934
01:08:48,118 --> 01:08:52,285
then we have not so,
relatively clean gradient flow

1935
01:08:53,265 --> 01:08:54,381
through these forget gates,

1936
01:08:54,381 --> 01:08:56,631
since they're all
initialized to be near one.

1937
01:08:56,631 --> 01:08:58,934
And then throughout
the course of training,

1938
01:08:58,935 --> 01:09:00,308
then the model can learn those biases

1939
01:09:00,308 --> 01:09:02,742
and kind of learn to
forget where it needs to.

1940
01:09:02,742 --> 01:09:04,886
You're right that there
still could be some potential

1941
01:09:04,886 --> 01:09:06,246
for vanishing gradients here.

1942
01:09:06,246 --> 01:09:07,569
But it's much less extreme

1943
01:09:07,569 --> 01:09:09,182
than the vanilla RNN case,

1944
01:09:09,182 --> 01:09:12,126
both because those fs can
vary at each time step,

1945
01:09:12,126 --> 01:09:14,590
and also because we're doing

1946
01:09:14,590 --> 01:09:15,620
this element wise multiplication

1947
01:09:15,620 --> 01:09:19,126
rather than a full matrix multiplication.

1948
01:09:19,126 --> 01:09:20,801
So you can see that this LSTM

1949
01:09:20,801 --> 01:09:23,048
actually looks quite similar to ResNet.

1950
01:09:23,048 --> 01:09:24,510
In this residual network,

1951
01:09:24,510 --> 01:09:26,732
we had this path of identity connections

1952
01:09:26,732 --> 01:09:28,069
going backward through the network

1953
01:09:28,069 --> 01:09:30,362
and that gave, sort of
a gradient super highway

1954
01:09:30,363 --> 01:09:32,303
for gradients to flow backward in ResNet.

1955
01:09:32,303 --> 01:09:34,924
And now it's kind of the
same intuition in LSTM

1956
01:09:34,924 --> 01:09:37,944
where these additive and element wise

1957
01:09:37,944 --> 01:09:40,226
multiplicative interactions
of the cell state

1958
01:09:40,227 --> 01:09:42,305
can give a similar gradient super highway

1959
01:09:42,305 --> 01:09:44,328
for gradients to flow backwards
through the cell state

1960
01:09:44,328 --> 01:09:45,245
in an LSTM.

1961
01:09:46,343 --> 01:09:48,593
And by the way, there's this
other kind of nice paper

1962
01:09:48,593 --> 01:09:49,682
called highway networks,

1963
01:09:49,682 --> 01:09:51,808
which is kind of in between this idea

1964
01:09:51,808 --> 01:09:53,225
of this LSTM cell

1965
01:09:54,138 --> 01:09:56,471
and these residual networks.

1966
01:09:57,796 --> 01:09:58,697
So these highway networks

1967
01:09:58,697 --> 01:10:01,165
actually came before residual networks,

1968
01:10:01,165 --> 01:10:03,008
and they had this idea where

1969
01:10:03,008 --> 01:10:04,541
at every layer of the highway network,

1970
01:10:04,541 --> 01:10:05,984
we're going to compute

1971
01:10:05,984 --> 01:10:07,373
sort of a candidate activation,

1972
01:10:07,373 --> 01:10:08,804
as well as a gating function

1973
01:10:08,804 --> 01:10:10,792
that tells us that interprelates

1974
01:10:10,792 --> 01:10:13,045
between our previous input at that layer,

1975
01:10:13,045 --> 01:10:14,676
and that candidate activation

1976
01:10:14,676 --> 01:10:17,017
that came through our
convolutions or what not.

1977
01:10:17,017 --> 01:10:19,455
So there's actually a lot of
architectural similarities

1978
01:10:19,455 --> 01:10:20,472
between these things,

1979
01:10:20,472 --> 01:10:21,821
and people take a lot of inspiration

1980
01:10:21,821 --> 01:10:24,363
from training very deep CNNs

1981
01:10:24,363 --> 01:10:25,196
and very deep RNNs

1982
01:10:25,196 --> 01:10:27,688
and there's a lot of crossover here.

1983
01:10:27,688 --> 01:10:31,682
Very briefly, you'll see a
lot of other types of variance

1984
01:10:31,682 --> 01:10:33,777
of recurrent neural network
architectures out there

1985
01:10:33,777 --> 01:10:34,675
in the wild.

1986
01:10:34,675 --> 01:10:36,760
Probably the most common,
apart from the LSTM,

1987
01:10:36,760 --> 01:10:40,853
is this GRU, called the
gated recurrent unit.

1988
01:10:40,853 --> 01:10:42,665
And you can see those
update equations here,

1989
01:10:42,665 --> 01:10:45,701
and it kind of has this
similar flavor of the LSTM,

1990
01:10:45,701 --> 01:10:49,318
where it uses these
multiplicative element wise gates

1991
01:10:49,318 --> 01:10:51,417
together with these additive interactions

1992
01:10:51,417 --> 01:10:53,828
to avoid this vanishing gradient problem.

1993
01:10:53,828 --> 01:10:55,584
There's also this cool paper

1994
01:10:55,584 --> 01:10:57,679
called LSTM: a search based oddysey,

1995
01:10:57,679 --> 01:10:59,734
very inventive title,

1996
01:10:59,734 --> 01:11:02,853
where they tried to play
around with the LSTM equations

1997
01:11:02,853 --> 01:11:04,980
and swap out the non
linearities at one point,

1998
01:11:04,980 --> 01:11:06,329
like do we really need that tanh

1999
01:11:06,329 --> 01:11:07,343
for exposing the output gate,

2000
01:11:07,343 --> 01:11:09,756
and they tried to answer a lot
of these different questions

2001
01:11:09,756 --> 01:11:11,367
about each of those non linearities,

2002
01:11:11,367 --> 01:11:14,017
each of those pieces of
the LSTM update equations.

2003
01:11:14,017 --> 01:11:15,374
What happens if we change the model

2004
01:11:15,374 --> 01:11:18,068
and tweak those LSTM
equations a little bit.

2005
01:11:18,068 --> 01:11:19,038
And kind of the conclusion is

2006
01:11:19,038 --> 01:11:20,260
that they all work about the same

2007
01:11:20,260 --> 01:11:22,414
Some of them work a little
bit better than others

2008
01:11:22,414 --> 01:11:24,521
for one problem or another.

2009
01:11:24,521 --> 01:11:25,735
But generally, none of the things,

2010
01:11:25,735 --> 01:11:28,036
none of the tweaks of LSTM that they tried

2011
01:11:28,036 --> 01:11:30,911
were significantly better
that the original LSTM

2012
01:11:30,911 --> 01:11:32,148
for all problems.

2013
01:11:32,148 --> 01:11:33,647
So that gives you a little bit more faith

2014
01:11:33,647 --> 01:11:36,505
that the LSTM update
equations seem kind of magical

2015
01:11:36,505 --> 01:11:38,889
but they're useful anyway.

2016
01:11:38,889 --> 01:11:41,084
You should probably consider
them for your problem.

2017
01:11:41,084 --> 01:11:43,692
There's also this cool paper
from Google a couple years ago

2018
01:11:43,692 --> 01:11:44,575
where they tried to use,

2019
01:11:44,575 --> 01:11:46,520
where they did kind of
an evolutionary search

2020
01:11:46,520 --> 01:11:48,117
and did a search over many,

2021
01:11:48,117 --> 01:11:52,605
over a very large number of
random RNN architectures,

2022
01:11:52,605 --> 01:11:55,694
they kind of randomly premute
these update equations

2023
01:11:55,694 --> 01:11:57,936
and try putting the additions
and the multiplications

2024
01:11:57,936 --> 01:11:59,332
and the gates and the non linearities

2025
01:11:59,332 --> 01:12:00,860
in different kinds of combinations.

2026
01:12:00,860 --> 01:12:03,288
They blasted this out over
their huge Google cluster

2027
01:12:03,288 --> 01:12:04,351
and just tried a whole bunch

2028
01:12:04,351 --> 01:12:08,570
of these different weigh
updates in various flavors.

2029
01:12:08,570 --> 01:12:09,742
And again, it was the same story

2030
01:12:09,742 --> 01:12:11,299
that they didn't really find anything

2031
01:12:11,299 --> 01:12:12,607
that was significantly better

2032
01:12:12,607 --> 01:12:15,446
than these existing GRU or LSTM styles.

2033
01:12:15,446 --> 01:12:16,978
Although there were some
variations that worked

2034
01:12:16,978 --> 01:12:19,518
maybe slightly better or
worse for certain problems.

2035
01:12:19,518 --> 01:12:21,304
But kind of the take away is that

2036
01:12:21,304 --> 01:12:24,252
probably and using an LSTM or GRU

2037
01:12:24,252 --> 01:12:27,080
is not so much magic in those equations,

2038
01:12:27,080 --> 01:12:29,292
but this idea of managing
gradient flow properly

2039
01:12:29,292 --> 01:12:30,633
through these additive connections

2040
01:12:30,633 --> 01:12:32,023
and these multiplicative gates

2041
01:12:32,023 --> 01:12:33,356
is super useful.

2042
01:12:34,888 --> 01:12:37,286
So yeah, the summary is
that RNNs are super cool.

2043
01:12:37,286 --> 01:12:40,103
They can allow you to attack
tons of new types of problems.

2044
01:12:40,103 --> 01:12:42,024
They sometimes are
susceptible to vanishing

2045
01:12:42,024 --> 01:12:43,431
or exploding gradients.

2046
01:12:43,431 --> 01:12:44,740
But we can address that
with weight clipping

2047
01:12:44,740 --> 01:12:47,412
and with fancier architectures.

2048
01:12:47,412 --> 01:12:48,899
And there's a lot of cool overlap

2049
01:12:48,899 --> 01:12:52,043
between CNN architectures
and RNN architectures.

2050
01:12:52,043 --> 01:12:53,856
So next time, you'll
be taking the midterm.

2051
01:12:53,856 --> 01:12:57,856
But after that, we'll
have a, sorry, a question?

2052
01:12:59,525 --> 01:13:00,801
Midterm is after this lecture

2053
01:13:00,801 --> 01:13:04,142
so anything up to this point is fair game.

2054
01:13:04,142 --> 00:00:00,000